Last week I wrote a summary blog post of the first four weeks of the internship but it never got used so I am going to use it for this week’s post.
First four weeks summary:
My development work can be divided into two categories: data analysis with Python scripts and data visualisation with Google Charts.
Data analysis scripts
At the beginning of the second week we were provided with a set of csv files containing the latest data at the time of the Developing Your Research Project MOOC. Based on the analysis tasks I was given, I started work on Python scripts that will filter the raw data and produce basic visualisations. To help me figure out the data manipulation and filtering process, I first implemented it in Libre Calc and then tried to recreate it in code. I came to realise that the analysis mostly required pivoting the data in some way so I researched the best tools for doing that in Python. In the end I decided to use the pandas library as it seemed to be the standard across the data science community and provides similar functionality to R in Python. The easiest way of installing it was through the Anaconda Python distribution which comes with a set of essential libraries for data analysis including matplotlib, which I used for simple visualisations.
The following is a list of the scripts I have developed paired with a short description of their functionality:
day_analysis.py – for each day of the course, plots the number of unique learners who have visited a step for the first time
time_analysis.py – same as for the day analysis but plots the data by hour of the day
reply_analysis.py – for each step of the course, plots the percentage of replies to comments
enrollment.py – for each day of the course, plots the total number of enrolled students
All of these scripts can be found on our GitHub repo at https://github.com/Lubo-93/MOOC-Viz-Scripts (note that they will not work without the data sets, which are not provided in the repo).
As long as the data format of the original csv files doesn’t change, these scripts will be able to filter and visualise new data as it is supplied. Since most of the csv files are similar in structure and producing the visualisations requires pivoting, not a lot of changes are needed to the code to adapt the scripts to different analysis scenarios. However, in future work the scripts could be generalised to a single script the manipulates the data depending on user supplied parameter values. This would be beneficial for the final dashboard as well. Additionally, all the scripts can export their pivots to JSON but further work is needed on correct formatting.
As far as data visualisation goes, I decided to use Google Charts because of its simple API and its dashboard feature which allows you to bind multiple charts and filter controls so that they all respond together when the data or filter is modified. I learned how to develop with Google Charts during WAIS Fest for which I made a dashboard with different chart types for UK road traffic data. Although it was not completely finished, the experience taught me how to work with Google Charts and I also became aware of some of its limitations. For example, it doesn’t let you specify the regions displayed on its geo maps (e.g. the UK map only shows England, Wales, Scotland and Northern Ireland; you can’t include any more specific administrative regions). However, I discovered a workaround by using Highmaps – it allows you to specify regions with GeoJSON encoded strings or you can make completely custom maps (both of which I successfully tried, although using GeoJSON proved to be really slow). With the skills I gathered from WAIS Fest I developed a dashboard that visualises course activity and enrolment data with multiple choices of filtering.
This week I continued with the jobs I had left unfinished from last week. I changed the table structure and used other ways to import csv files into MySQL. Currently, it seems work well and take less time. After a discussion with Lubo and Manuel, I decided to use this version for the time being.
Besides importing efficiency, fetching data quickly is another factor I need consider. MySQL allows us to set up an index to accelerate searching, but it take more time to insert data because MySQL should assign an index to each row. So there is a balance we need decide.
After dealing with MySQL, we started to learn a new programming language for data analysis – R. R is easy to learn and use. To compare it with Python, it costs less time to work out same data. I studied all the chapters in the R online tutorial and now I am familiar with the syntax and have learned about some quite useful and interesting features of R. I also tried to convert my python scripts into R and compared both – I think R works better. During the following week, I will keep going on with my research on R.