Skip to content


MOOC Observatory thoughts

We have had a number of discussions about the MOOC observatory and its potential links to the Web Observatory over recent months and I thought I would post a quick informal summary of thoughts.

I’ve brought this together but all the work referred to involves lots of people from the MOOC Obs. community here at Southampton, and the many people involved in our growing number of UoS FutureLearn MOOCs, not least the learners who have given so much to the communities and to the courses. Particular input to this post was from Robert Blair, Lisa Harris, Manuel León Urrutia, Tim O’Riordan and Olya Rastic-Dulborough.

The Web Observatory provides a mechanism for hosting and analysing data from the web, such as social meeting posts, web page update logs and so on.

The MOOC Observatory as I see it brings together interests in MOOC learning design, analytics, their implications for education practice, and so on. This is itself what we would see as a Web Science activity and it is therefore it is clearly appropriate to link the two observatories, both conceptually and practically.

I have concentrated in this post on some thoughts on the practical implications and potential of this connection.

Data Hosting and Aggregation

The Web Observatory provides a means to host and share data captured from MOOCs. Our efforts at Southampton are concentrated on FutureLearn but I would see the observatory aggregating content from any MOOC platform that allows download of data to be a core benefit.

There are significant legal and ethical considerations in all this work. At the time of writing the FL terms and conditions and code of ethics are available here:


For our purposes the key issues here relate to managing the data provided by FL. I have extracted and sometimes paraphrased these from the documents above – please always use those in the originals as the definitive source 🙂

  • No one can harvest learner personal data or course content
  • Only use or distribute the material on the courses for the intended purposes and under the terms of the existing license (which in some but not all cases will be CC)
  • All research studies by FL and its University Partners are subject to a code of research ethics (note: these documents don’t define what University Partners are but they are assumed to be equivalent to Partner Institutions. The latter are loosely defined but there is no further granularity eg in terms of which members of a Partner Institution can exert the rights assigned to it).
  • FL and Partners can do research only on anonymised data, unless clear rationale provided for using real names. This data includes comments created by learners. (Note: The FL research ethics do require that only those people absolutely requiring access to non-anonymised data should be granted this. Since any data containing an original comment number, full timestamp or comment content could easily be de-anonymised we have assumed that all data access to these fields should be clearly justified. Similarly we see no issue in sharing data missing these fields. Whilst de-anonymisation could be possible e.g. via cluster analysis by learner this would constitute a clear breach of research ethics, would probably require banned scraping of content, and would often in any case be easy to undertake using the FL platform’s own functionality e.g. viewing all comments by a specific learner).
  • Not selling or giving away any data that identifies individual learners.
  • Disclosure of specific content (eg quoting a comment by a named educator) is possible providing the learner is acknowledged under the terms of the CC BY-NC-CD license associated with comments. This doesn’t apply to the assignments etc.
  • The work being undertaken in this area at the moment is covered by University of Southampton research ethics approval number 12449.

Types of data

Currently as an institutional partner we can download the following FL information:

  2. Enrollment
  3. Navigation of site (some)
  4. Peer review submissions and reviews
  5. Question answer statistics

In addition we are provided with the summarised results of the pre and post course surveys. We can also make specific requests regarding the Google Analytics data captured from the platform, such as browser or location data.

As a shorthand in the remainder of this post I will refer to the MOOC observatory (data) or MOD as the implementation of data hosting and analytics within the Web Observatory. To date we have placed FL analytics data in the MOD secured to specific users. Data are manually downloaded and uploaded.

I’m also busy preparing some dummy datasets to upload which will be publicly accessible in order to encourage development of FL data analytics by a broad, open community. Alongside these data I will include the SQL data structure and set of queries, (initially generated my me in Microsoft Access because it was easy and I am an increasingly sketchy hacker…), that allow rudimentary analyses.

Others in the MOOC Obs/ Web Obs teams are doing much cooler analytics that you can read about on this blog – and more to come in the summer once the third run of the Archaeology of Portus and other UoS summer MOOCs like Shipwrecks and Battle of Waterloo and are done. The full list of Southampton MOOCs is here.

The use of MS Access was, as I say a purely pragmatic choice – we had users able to use it, we could easily generate interactive reports, and it was easily able to cope with the data volume available at that point – approximately 500,000 comments from across c 10 FL course runs. Its a lot more now.

The Access database uses linked tables in order to allow easy refreshing of data. FL shares a new set of updated data files each day that a course is running, and periodically thereafter when new data are created. These data are accessed via the FL website following login by a course Educator. I have been provided with access to all data relating to all of the FL courses produced by the University of Southampton. To simplify updates I partially automated this so we could capture the latest files from each course every day.

Working with Access and CSV clearly will not scale. I guess I should have hidden this approach (!) but its an advantage of being an archaeologist working in a computer science domain that I can occasionally plead digital ignorance. The bottom line is that it works for now – and crucially I was able to get data sometimes only with a day’s lag to the educators on the last run of the Portus course, which I think significantly sped up response times and targeting of effort.

In the next run it should be even better – we shall see. I hope that, subject to the ethics and legal position described above, the learners will also be able to benefit from access to much of this information. It is of course the result of their input and I am committed to their data being used every step of the way to improve the learning experience. I defy anyone to spend time as an educator on a MOOC and not to be amazed and humbled by the depth and generosity of learner activity.

Still, on a standard workstation running some of the Access queries will rapidly become impractical. As it is I have had to cache data to generate the necessary cross-course UNIONs. The expertise of the Web Observatory and data.soton teams therefore will be invaluable going forwards. So, we are currently evaluating a best route to managing the MOOC data in a way that will scale to many millions of comments and other learner interactions.

The data.soton initiative has demonstrated how CSV data can be converted to RDF but are linked data useful here? And what platform is best to host the linked data if so? Should the comments data, suited to natural language processing, be held separately from the numerical and classificatory text data? I’m sure that there is cool linked data work in this space already and hopefully by the end of the summer we can be much further down the line.

We have restricted the data in the MOD so far to that gathered by Southampton whilst clarification is sought about the legal restrictions on accepting information from other MOOCs. We also have a companion password-restricted data repository using Sharepoint as the default storage for our FL data.

The Web Observatory does not provide specific repository policies such as retention, data migration and so on as a default. For this reason we are developing a set of policies specifically for the MOD.


FL have kindly shared the types of queries that they apply habitually. Indeed they are extremely supportive of all the research going on around their courses. The FutureLearn Academic Network also provides a broad range of further advice and case studies. The following summarise my own attempts to implement some of these and other analyses to enhance the efficacy of the courses we run. As a team we have also used progressive changes to courses and comparison of the resultant learner behaviours as living labs, again to improve the learning experience.

So, for example on the Archaeology of Portus course we have experimented with different levels and types of educator and facilitator engagement.

The first iteration of the course concluded with a week of overlap between the MOOC and a face to face Portus Field School taking place in Italy. The online course influenced behaviour on site, including capture of new learning materials.

The second run of the course did not overlap with a f2f course or fieldwork in Italy but week six interactive content was filmed in a studio at UoS. New material was created on the basis of the number of likes each question raised on the course received. This required the development of specific search tools.

The third run of the course which starts next week on 15 June 2015 has been scheduled deliberately so that three of its six weeks overlap with the Portus Field School, but that these do not include the final week. We will not use week six to address queries but instead target feedback at timetabled points throughout the course, and attempt to make much more use of automated and learner-generated kinds of feedback.

Thankfully some of our learners have taken part in both previous iterations of the course and will be returning for a third time – they provide much needed additional ideas and critique.

Educators contributed a great many comments across the first run of the Portus course. Our impression (currently without much formal analysis I admit) is that the second iteration saw lower levels of educator and facilitator input, but that this was still significant and was more effectively targeted.

In the third iteration we will use our analyses of the previous runs to provide expert input and facilitation in a more targeted, economical way. To this end we are currently seeking permission from all UoS facilitators and educators to de-anonymise their comments in order that we can examine their impact on discussions and the wider learning experience.

I have also had a go with rudimentary topic analysis to see how better to structure the learning within the course. Once this can be applied directly from within the MOOC Observatory and hence across all course data we should see fascinating parallels and differences as a consequence of course content and learning design.

Some of the first goes at analysing the data are available on the Archaeology of Portus blog, for example FutureLearn social network: Portus in the UoS MOOCosphere. We will post some further updates soon.