Vancouver Python Day

On Saturday 17 November I attended the Vancouver Python Day. There were many interesting talks, the highlight being a talk about using Python to build an internet radio on top of a Raspberry Pi computer. I am new to the Python language so it was great to get introduced to concepts such as decorators and metaprogramming. Still a bit over my head but I know what to aim for. The most useful talk for me was about iPython and the iPython Notebook. This are great tools to build a set of interactive notes while learning Python.

The start of my industrial career!!!

It has been a while since I last posted on my blog. That is because I have set the first steps in the direction to becoming a data scientist: in October I started an internship as a data scientist at Metafor Software. At the same time we’re going to apply for the NSERC Industrial R&D Fellowship. If the fellowship is approved, then I will be hired for a full-time position!

Metafor Software specializes in anomaly detection and brought me on to build analytics tools for time-series analysis. The main project I will start to work on in the future are automated monitoring tools. Building unsupervised machine learning tools which are able to distinguish anomolous behavior in time series of… well basically anything: CPU, memory but in general any KPI/business metric you would like to monitor for anomalous behaviour.

Besides the time-series analysis I will also help to develop a cohesiveness algorithm. This tool allows to detect servers who behave abnormal in comparison with the majority of the servers in the cluster. The algorithm takes as input all the data per server (like CPU and memory) and then automatically detects which server is acting out of whack in comparison with the overall performance of the cluster. So pretty soon I’ll be doing some happy machine learning!

Resources for learning about databases

A data scientist spends a lot of time dealing with the data and getting it in and out of databases. There are some good resources for learning SQL and MapReduce. On Coursera there is the course Introduction to Databases which mainly deals with relational databases and SQL (queries, normal forms, constraints & triggers, views, etc.). Coursera also offers a Data Science course where, among the many interesting topics, they spend some time discussing MapReduce. If you want to practice with the ideas of MapReduce there is JSMapReduce where you can play around with python or Javascript code.

Home-made queuing simulation software

Visualization of effects of prioritizing customers/jobs. Queue 1 gets priority over both Queue 2 & 3 while Queue 2 gets priority over Queue 3

Visualization of effects of prioritizing customers/jobs. Queue 1 gets priority over both Queue 2 & 3 while Queue 2 gets priority over Queue 3

Near the end of my PhD I started volunteering for RESAAS. They were interested in modelling the traffic on their website. So I developed some simulation software in C#. They were kind enough to let me publish parts of this program.

The main part of the software is a simulation program which compares a queueing system where customers are served on a First-Come-First-Served basis with a queueing system where some customers get priority over others. A detailed explanation for the program can be found here. The C# code is avaiable on GitHub.

Twitter API, sentiment, text mining and visualization

Relative sentiment plot based on twitter data

Example of a (relative) sentiment plot based on twitter data

I have been taking an on line course on Data Science at Coursera. One of the assignments concerned the mining of Twitter texts. We had to extract live twitter data and use text mining techniques to compute sentiment scores. Sentiment analysis applies natural language processing, computational linguistics, and text analytics to determine whether a text is positive, negative, or neutral.

I thought it would be neat to extract the sentiment scores for each state of the US and then plot the relative sentiment scores. Hence, I wrote some Python code to extract live tweets and compute the average sentiment score per state of the US. I feeded this data into R to make the visualization above. (This is all combined into a single shell script.)

First of all, a word of warning. This is more of an exercise in data visualization and using Twitter API’s. The meaning of a sentiment score is up for debate (see this nice post) and I have not yet done a thorough analysis of the scores. In particular, many states get a score of 0 simply because the data is not very clean. The code for this project is available on GitHub.