Resources for learning about databases

A data scientist spends a lot of time dealing with the data and getting it in and out of databases. There are some good resources for learning SQL and MapReduce. On Coursera there is the course Introduction to Databases which mainly deals with relational databases and SQL (queries, normal forms, constraints & triggers, views, etc.). Coursera also offers a Data Science course where, among the many interesting topics, they spend some time discussing MapReduce. If you want to practice with the ideas of MapReduce there is JSMapReduce where you can play around with python or Javascript code.

Home-made queuing simulation software

Visualization of effects of prioritizing customers/jobs. Queue 1 gets priority over both Queue 2 & 3 while Queue 2 gets priority over Queue 3

Visualization of effects of prioritizing customers/jobs. Queue 1 gets priority over both Queue 2 & 3 while Queue 2 gets priority over Queue 3

Near the end of my PhD I started volunteering for RESAAS. They were interested in modelling the traffic on their website. So I developed some simulation software in C#. They were kind enough to let me publish parts of this program.

The main part of the software is a simulation program which compares a queueing system where customers are served on a First-Come-First-Served basis with a queueing system where some customers get priority over others. A detailed explanation for the program can be found here. The C# code is avaiable on GitHub.

Twitter API, sentiment, text mining and visualization

Relative sentiment plot based on twitter data

Example of a (relative) sentiment plot based on twitter data

I have been taking an on line course on Data Science at Coursera. One of the assignments concerned the mining of Twitter texts. We had to extract live twitter data and use text mining techniques to compute sentiment scores. Sentiment analysis applies natural language processing, computational linguistics, and text analytics to determine whether a text is positive, negative, or neutral.

I thought it would be neat to extract the sentiment scores for each state of the US and then plot the relative sentiment scores. Hence, I wrote some Python code to extract live tweets and compute the average sentiment score per state of the US. I feeded this data into R to make the visualization above. (This is all combined into a single shell script.)

First of all, a word of warning. This is more of an exercise in data visualization and using Twitter API’s. The meaning of a sentiment score is up for debate (see this nice post) and I have not yet done a thorough analysis of the scores. In particular, many states get a score of 0 simply because the data is not very clean. The code for this project is available on GitHub.