Skip to content

All posts by Lyubomir Vasilev

MOOC Viz Interns Week 5 Update

Lubo:

This week I found a new tool for building the dashboard – Shiny.

Shiny is a web framework for R that allows users to easily create good looking analytics dashboards directly in R, without the need of any HTML or backend programming. However, a great benefit of Shiny is that it is built around HTML and if you desire, you can directly add HTML elements or apply a CSS theme. Additionally, the framework offers a good range of filtering controls and supports dynamic content. As far as visualisations go, Shiny works with some of the most popular JavaScript charting libraries, including HighCharts, D3, dygraphs and more.

Given all of its features and the fact that it can easily be understood and shared amongst the R community, we have decided to use Shiny for the final dashboard.

Most of the work I’ve done this week has been comprised of researching and learning Shiny. Apart from that, I received access to the course survey data and I was given the task of making different visualisation by filtering learners based on their survey responses. To accomplish this, I produced two scripts – one which filters out learners based on their survey responses and another that makes different analyses based on the learner ids returned from the methods of the first script.

The next step is to use the scripts I have made to build an initial dashboard using Shiny.

Lin:

This week I have completed two tasks, importing the MySQL data to a remote server and fetching course data from FutureLearn.

The first one wasn’t very difficult because I have a script which is able to import data into localhost, I just needed to change arguments so that it can import it to a remote server. However, the script didn’t work well because I had some misunderstandings about how MySQL works. MySQL works locally, so if I want to connect to a remote server, I have to create an SSH tunnel and bind it to localhost. Fortunately, there is an external module – SSHTunnel which allows us to bind remote servers easily, so far it works without error.

The second task was harder for me because of my lack of experience. The goal was to create a script that will automatically download the course data from FutureLearn and upload it to the MOOC Observatory at regular periods. To accomplish this I had to write HTTP requests in Python. Given that I have never learned anything related to HTTP before, it took me a few days to build some basic knowledge. Currently, I am just waiting for an admin account because I need to analyse admin webpage. Additionally, I need to decide a suitable period to update data depending on web server.

I think our current progress is good and I believe we are able to finish our project on time. Hopefully nothing will go wrong in the near future. I will also try my best on this project in the following weeks

MOOC Visualisation Interns Week 4 Update

Lubo:

Last week I wrote a summary blog post of the first four weeks of the internship but it never got used so I am going to use it for this week’s post.

First four weeks summary:

My development work can be divided into two categories: data analysis with Python scripts and data visualisation with Google Charts.

Data analysis scripts

At the beginning of the second week we were provided with a set of csv files containing the latest data at the time of the Developing Your Research Project MOOC. Based on the analysis tasks I was given, I started work on Python scripts that will filter the raw data and produce basic visualisations. To help me figure out the data manipulation and filtering process, I first implemented it in Libre Calc and then tried to recreate it in code. I came to realise that the analysis mostly required pivoting the data in some way so I researched the best tools for doing that in Python. In the end I decided to use the pandas library as it seemed to be the standard across the data science community and provides similar functionality to R in Python. The easiest way of installing it was through the Anaconda Python distribution which comes with a set of essential libraries for data analysis including matplotlib, which I used for simple visualisations.

The following is a list of the scripts I have developed paired with a short description of their functionality:

day_analysis.py – for each day of the course, plots the number of unique learners who have visited a step for the first time
time_analysis.py – same as for the day analysis but plots the data by hour of the day
reply_analysis.py – for each step of the course, plots the percentage of replies to comments
enrollment.py – for each day of the course, plots the total number of enrolled students

All of these scripts can be found on our GitHub repo at https://github.com/Lubo-93/MOOC-Viz-Scripts (note that they will not work without the data sets, which are not provided in the repo).

As long as the data format of the original csv files doesn’t change, these scripts will be able to filter and visualise new data as it is supplied. Since most of the csv files are similar in structure and producing the visualisations requires pivoting, not a lot of changes are needed to the code to adapt the scripts to different analysis scenarios. However, in future work the scripts could be generalised to a single script the manipulates the data depending on user supplied parameter values. This would be beneficial for the final dashboard as well. Additionally, all the scripts can export their pivots to JSON but further work is needed on correct formatting.

Data visualisation

As far as data visualisation goes, I decided to use Google Charts because of its simple API and its dashboard feature which allows you to bind multiple charts and filter controls so that they all respond together when the data or filter is modified. I learned how to develop with Google Charts during WAIS Fest for which I made a dashboard with different chart types for UK road traffic data. Although it was not completely finished, the experience taught me how to work with Google Charts and I also became aware of some of its limitations. For example, it doesn’t let you specify the regions displayed on its geo maps (e.g. the UK map only shows England, Wales, Scotland and Northern Ireland; you can’t include any more specific administrative regions). However, I discovered a workaround by using Highmaps – it allows you to specify regions with GeoJSON encoded strings or you can make completely custom maps (both of which I successfully tried, although using GeoJSON proved to be really slow). With the skills I gathered from WAIS Fest I developed a dashboard that visualises course activity and enrolment data with multiple choices of filtering.

Lin:

This week I continued with the jobs I had left unfinished from last week. I changed the table structure and used other ways to import csv files into MySQL. Currently, it seems work well and take less time. After a discussion with Lubo and Manuel, I decided to use this version for the time being.

Besides importing efficiency, fetching data quickly is another factor I need consider. MySQL allows us to set up an index to accelerate searching, but it take more time to insert data because MySQL should assign an index to each row. So there is a balance we need decide.

After dealing with MySQL, we started to learn a new programming language for data analysis – R. R is easy to learn and use. To compare it with Python, it costs less time to work out same data. I studied all the chapters in the R online tutorial and now I am familiar with the syntax and have learned about some quite useful and interesting features of R. I also tried to convert my python scripts into R and compared both – I think R works better. During the following week, I will keep going on with my research on R.

MOOC Visualisation Interns Week 3 Update

Lubo:

In the beginning of the week I managed to finalise all of the python scripts that I had been working on. At this moment I have successfully produced scripts that filter and visualise the following data:

  • Step activity by day
  • Step activity by time
  • Number of comment replies per step

Additionally, I can export the script generated tables to JSON, although some further work into desired formatting is needed.

From Wednesday onwards we focussed on WAISFest. We took part in Chris’s team and we worked on developing a learner course on Data Science in the form of an iBook. The experience was both enjoyable and valuable as I got to use Google Charts for the first time and I can now successfully visualise data from JSON files which will probably be needed for the main project.

Next week I will use the knowledge I’ve gathered to develop an animated visualisation of weekly activity on Google Charts.

Lin:

This week we focused more on WAIS Fest. Our group’s topic is Data Science, we developed a tool in Mac iBook which can help people without any relevant knowledge to learn it. We assigned different tasks to each group member. Lubo and me did visualization part. We used UK traffic dataset that Lubo found online. This time we used new visualization tool – Google chart and Javascript. It is very helpful for our project to learn new tool. I learned various experience and knowledge from this activity.

In additional, I tried different ways to improve the efficiency for converting csv into MySQL as well. However, it seems a bit difficult for me currently because of lacking of experience in MySQL. Next week I will read more documents related to MySQL and find some better method to improve it.

In next week, we will continue our MOOC project. We planned to start a initial report, it may be just a simple draft, talking some background of our project. More detail will be discussed in next week.

MOOC Visualisation Interns Week 2 Update

Lubo:

This week was a bit different in terms of the type of work that we had. On Monday, we got introduced to FutureLearn’s data sets and given a sample to do some initial analysis on, in order to get accustomed to using the data and thinking about it.

For the rest of the week, the work I have done primarily involved producing charts (based on Manu’s analysis questions) in Libre Office, after which attempting to reproduce them in python.

I successfully developed the charts in Libre, although I ran into several technical difficulties due to the size of the data sets and the performance of the lab machine. I’ve had reasonable success reproducing them in python but the process was slow and I encountered several issues, arising mostly from my lack of experience in python and particularly the pandas library, which I have decided to use for the data manipulation.

Apart from that, I read an interesting paper on unsupervised dialogue act modelling which could be potentially quite useful if we decide to classify comments. I have added the paper to the MOOC Observatory Mendeley group.

For next week I will wrap up the python scripts I’ve been working on and I will start with the task of developing an animated visualisation of step activity over the weeks.

Lin:

Last week we read many papers to understand fundamental concept of MOOC visualization and found some useful information to help our project. Besides reading papers, we played data provided by Manu in this week, it is a quite interesting job. We visualized this data from FutureLearn by python script and external visualization tool. I used plotly, a online analytic and data visualization tool which can be used easily.

The graph showed below is a sample which is visualized by plotly., it illustrates the reply percentage in each step (first 2 weeks). We can find that the highest percentage is in step 2.5, which participants need describe their projects, in addition, they can review others’ projects and give a assessment for it. That is why it gets the more reply than rest steps. For some steps which do not set up a forum discussion, I removed from this graph, so you can see some steps disappear such as step 2.7 to step 2.9

11733274_782639418520640_284374899_n

When I visualized data, I encountered some problem as well. Choosing a appropriate chart to visualize is difficult for me sometimes, especially when I get more data types to present. It is a problem I need solve in next few weeks. The other problem is about the efficiency, currently the data we get is quite small compared to big data. In some scripts it need take 2 – 3 seconds to analyse all data and generate a graph. It will cause very serious problem once I read big data in the future. I tried to use different collections to store and search this data such , current version is faster but it still take 1 second. I am not sure whether it is efficient enough or not, but we must pay more attention to this problem when we develop our project.

MOOC Visualisation Interns Week 1 Update

This was our first week as interns working on the Future Learn MOOC data visualisation project. During this time we became acquainted with the general goals of the research and met some of the people that are involved with it. However, the specific project requirements will be discussed with the other researchers over the course of the following weeks.

Most of our work for this initial week was comprised of reading the same set of papers related to MOOC data mining, analysis and organisation. The remainder of this post consists of our individual accounts of the research we have done.

Lubo:

Most of the papers that I read this week came from the initial list of recommended reading given to us by our supervisor. Following is a brief overview of the goals and findings of each of these papers:

  • MOOCdb – the initial introduction of the already established standardised database schema for raw MOOC data; the original proposal of the paper is a standardised, cross-course, cross-platform database schema which will enable analysts to easily work on the data by developing scripts; the intention is to build a growing community by developing a repository of scripts; the concept proposes the sharing of data without exchanging it
  • MOOCViz –  presents the implementation of the analytics platform envisioned in the MOOCdb paper; the framework provides the means for users to contribute and exchange scripts and run visualisations
  • Learner Interactions During Online MOOC Discussions – Ayse’s paper from the WAIS group; investigates the relation between high attrition rates and low levels of participation in online discussions; provides a novel model of measuring learners’ interaction amongst themselves and offers a method of predicting possible future interactions; dividing the predictions in categories and the means of calculating friendship strength are particularly interesting
  • Monitoring MOOCs – a paper that reports the findings of a survey of 92 MOOC instructors on which information they find most useful for visualising student behaviour and performance; it provides good insight for the types of data and visualisation that would potentially be useful for our project; additionally, it is a very good reference source for papers dealing with different visualisation methods for MOOC data
  • Visualizing patterns of student engagement and performance in MOOCs – investigates high attrition rates; its main goals are to develop more refined learning analytic techniques for MOOC data and to design meaningful visualisations of the output; to do so it classifies student types by using learning analytics of interaction and assessment and visualises patterns of student engagement and success across distinct MOOCs; employs a structured analysis approach where specific variables and analyses results are determined iteratively at increasingly finer levels of granularity; utilises different visualisation diagrams that will likely be of interest for our project
  • Analyzing Learner Subpopulations in MOOCs – again, investigates attrition; previous paper took inspiration from this one for its analysis and visualisation approach; interesting method for classifying students by engagement; uses k-means clustering

The research I have conducted during this week has helped me to familiarise myself with the concept of MOOCs data visualisation and analysis and the challenges associated with them. More broadly, it has given me an insight into educational data mining and learning analytics. However, there is still an abundance of research that needs to be done. I have found that I am lacking in knowledge of statistics which prevents me from fully understanding some of the papers. In addition, there is a plethora of possible visualisation tools and methods available so becoming familiar with them and choosing the right ones in the available project time will prove to be challenging.

Apart from paper reading this week I also completed the first three weeks of the Doing Your Research Project MOOC to become acquainted with the structure of a typical MOOC on the Future Learn platform.

Lin:

A great number of researches are trying to find a suitable way to help instructors of MOOCs understand and analyse the interactions and performance of students. Because of the enormous amount of students enrolling in MOOCs, it is a big challenge for scientists to use this data. In the paper “MOOCdb: Developing Data Standards for MOOC Data Science”, the authors propose MOOCdb to manage data. MOOCdb adopts various strategies to make people use data more efficiently. For example, standardize data, data from different resources will be formatted in same way finally. In addition, what information is important and how it helps instructors to analyse the interactions of students are problems as well. Some researches propose that students’ interactions with courses should be determined by their grades and duration. Other researches realize different interactive patterns will also affect students’ performance. In Ayse’s paper, she proposes a strength value which can be worked out and predict the friendship between two students. It is quite interesting opinion. Although I have seen various ideas so far, it seems that it is not quite sufficient for me to do our project. Next week I plan to read more papers and do more research in this field.