Categories:

# Research Data Onion

We’ve been thinking a lot about research data and how to manage it, how to open it, how to share it, and how to get more value from it without making too much extra work. For some time I’ve been considering two different ways to think about research data.

## The Onion Diagram

The first is this diagram, which shows the various layers of metadata which I see as surrounding a research dataset.

Research Data Onion

Some of these layers are less obvious than others. The important thing is that each layer is created at a different time and process, has a different purpose and different people are responsible for it.

Often these layers are merged into a single database record, but it’s useful to think about them as distinct layers when looking at how to manage them.

As I’m most familiar with EPrints, I’ve included examples of where this information would be handled in that software if it was being used as a repository + catalogue for research datasets.

### 1. Research Dataset

This is the actual dataset produced as part of a research activity. It may be tiny or huge. It may or may not be available from a URL. In rare cases it may not be digital (a hand written log book of results). It might be the weird file format that the lasercryomagnoscopeatron produces, an excel spreadsheet or a hundred XML files. It may make sense to only a few people or tools.

If you are using EPrints as a dataset store+catalogue, this would be a document attached to an EPrints record.

This will be provided by the researcher, and research communities will need to decide what goes in this metadata. Librarians can certainly advise and assist but the buck stops with the researchers. This layer provides the research context for the dataset, it may include information about the processes used, the type and configuration of equipment. Long term, I expect equipment manufacturers to be able to create much of this and output it with the raw dataset, similar to how modern digital cameras embed EXIF data in the JPEG images they create.

This might be as simple as a text description of anything you might need to know before working with the data, such as assumptions made, or the sample size etc, but I suspect we’ll see some fields start to standardise what metadata should be provided for certain types of experiment.

In a subject-specific archive this metadata may be merged with the other layers of metadata but in a institutional repository there will be all kinds of weird and wacky datasets so its important that the people running the data catalogue are not proscriptive about this metadata, although a subject specific harvester may make some rules about what it should contain.

If you are using EPrints, this data would be stored in a supplimentary document attached to the record. A few years back we added a “metadata” document format for exactly this purpose.

I would expect that, in time, subject specific tools would harvest this data from multiple sources and give subject-specific search and analysis tools which would be beyond the scope of the university repository, but easy to implement on a big pile of similar scientific metadata records from many institutions, eg. a chemical research metadata aggregator could add a search by element (gold, lead..) which would be beyond the scope of the front end of the archive where the dataset is held.

Here we get to most people’s comfort zone. This is the realm of good ol’ Dublin Core. This describes the non-scientific context of the dataset: who created it, what parts of what organisations were involved, when, where and who owns it and what the license is.

With my “equipment data” hat on, this seems like the layer which associates the dataset with the the physical bit of equipment (eg. http://id.southampton.ac.uk/equipment/E0007), the facility, the research group, funder. Stuff like that. Things which the library and management care about, but don’t really matter to Science, unless you are evaluating how much confidence you have in the researchers.

In EPrints this is the metadata which is configurable by the site administrator and entered by the depositor or editor.

### 4. Data Catalogue Record Metadata

This is any data about the database record. Most of this will be collected automatically. It’s often in the same database table as the bibliographic data but it’s not quite the same thing.

This layer of the onion is stuff like who created the database record, when, what versions there have been. This can generally be created automatically by the repository/catalogue software.

In EPrints this is the fields which the system creates automatically.

This is generally merged with the bibliographic data layer unless you are doing some serious version control, but it is a distinct layer of metadata.

These last two layers are not really considered most of the time, but if we want things to be discoverable and verifiable it’s helpful to quantifiable.

This is the layer of metadata about the data catalogue itself. Not all data catalogues actuallycontain the the dataset, they may have got the record from another catalogue.

Anyhow, this layer tells you about what the catalogue contains, broadly, and the policies and ownership of the catalogue itself.

In EPrints this would be the repository configuration such as contact email, repository name, plus the fields which describes policy and licenses which many people don’t ever bother to fill in. You can see this data via the OAI-PMH Identify method.

This is something which nobody has given that much thought to yet, but for data.ac.uk we’ve proposed that UK universities should create a simple RDF document describing their organisation, with links to key datasets, such as  a research dataset catalogue, and other datasets which may be useful to automatically discover. This allows the repository to be marked as having an official relationship to the organisation. Some more information is available from the equipment data FAQ.

### Peeling the Onion

The last step is to make the Organisation Profile Document (layer 6) auto discoverable, given the organisation homepage. This means you can verify that an individual dataset is actually in a record, in a repository, which is formally recognised by the organisation (as oppose to set up by a stray 3rd year doing a project, or a service with test data etc). Creating and curating these layers provides auto-discovery and probity in a very straight forward manner.

Posted in Research Data.

# FatFree, RedBean and FloraForm – A light and flexible web framework

In December Web Team spent some time playing with web frameworks. My previous framework experience is with Django which I highly recommend but that was not appropriate here because Python is not one of iSolutions supported web languages. As a result we spent some time researching PHP frameworks. PHP is a bit of a hodge podge and PHP frameworks are no exception, there are a lot of  different options.

• Zend – Zend framework seems to be a everyone’s go to framework. I found it quite large and actually difficult to get set up and running meaning it was non starter. We do agile here if it takes more than 5 minutes to get started then you’ve missed the boat. It is very full featured and has big user base but seems to be aimed at “enterprise” which is a word I usually replace with “over complicated”.
• Cake – A bit easier to get started with than Zend and still fairly comprehensive. The Object Relational Mapper felt a bit backwards but it is popular and good community support.
• FuelPHP – I Invested a fair bit of time into this. Very cool, lots of nice features, good ORM. A bit complicated and the documentation and user community was a little new. I complained about the documentation being a bit lacking in places and they fixed it but I still wasn’t confident enough to choose it as solution.
• FatFree – Really pleased with this and chose it. The reasons are discussed below.

So why FatFree? Zero to writing code in less than 5 minutes. There are very few files and you can throw away the bits you don’t want to use. The documentation is good and Googling for problems gets solutions. Now the for real reasons. The other frameworks I looked at all require you to work inside them. We have a huge web presence which has been grown over 15 years rather than designed. FatFree let me use my old PHP and pepper it with FatFree. Over time more of the code we write will be converted to FatFree but we didn’t have to do a huge big bang move. Being able to gradually improve our existing stuff was important. Also FatFree is not trying to do EVERYTHING and as a result it is built to work with code which was not really design to integrate with FatFree . The best example is the ORM. FatFree provides a very basic Object Relational Mapper (takes php objects and stores in database). This would be a weakness if it was harder to integrate other libraries into FatFree. Enter RedBean PHP.

RedBean is the best object relational mapper I have ever used, absolutely no question about it. When prototyping an app in FuelPHP I had to know exactly what I wanted the database to look like at the start. RedBean lets you completely change that around exactly as you see fit while you program and just works out how to store it all in the database. There are a few nuances, aliasing had me completely stumped for about 20 minutes, but it’s easy when you’ve cracked it. The one thing missing was a slick way to take user input but FatFree’s flexible design enabled us to use the FloraForms library Chris had written.

FloraForm lets you easily construct a form, parse input and customize validation. There is still a bit of work required to make this a really reusable but it has become part of our core tool set so expect work in this area. The thing which made FloraForm the ideal addition to this little toolkit is it returns all of your form inputs in a big PHP hash. From this hash a 10 line function serializes the data into RedBean objects and similar 10 line function de-serializes it back into the form. The result is constructing a FloraForm interface builds your database tables and stores your data. This is a very fast and powerful combo. For simple systems you will need to do no further work and you can prototype complicated systems very fast and allowing you do make your radical design overhauls with very little effort. This is ideal for the ever changing goal posts of real world development with limited staff time.

One final note about FatFree worth mentioning is that it allowed members of the team which have not done framework development before to gently transition into frameworks. This may not sound significant but in a busy working environment having to completely overhaul your working practices in one go is very painful and time consuming. Day one of FatFree you can just use the router and do everything else as normal. After that maybe you will experiment with templates. Next time you build a database have a play with RedBean. Before you know where you are you are a full-blown framework developer without the upheaval of having to learn to do your job from scratch.

Tagged with , , , , , .

# Twilight of the JISC

This year many JISC funded services are “sunsetting”, presumably due to the cuts.

(nb. not everything JISC does is ending, but enough to be pretty brutal)

I have benefitted hugely in my career and projects from the support of many JISC services, events and staff. Dev8D changed my professional life in a really good way.

I offer my sincerest thanks to all JISC funded staff moving on to new jobs this year.

- Christopher Gutteridge

How have JISC staff, services or events helped you?

UPDATE: OSS Watch “changing funding model” http://osswatch.jiscinvolve.org/wp/2013/02/15/a-new-future-for-oss-watch/

Posted in Uncategorized.

# Gateway to Research API Hack Days

Ash and I are at the Gateway to Research API hack days. Gateway to Research contains data since 2006 about UK research project funding and related organisations, people and publications.

They use the CERIF data model, which is a bit of a monster. The CERIF people are very nice, but have limited resources to produce the kind of documentation I’ve become accustomed to. I enjoy cursing the darkness, but eventually I feel guilty and decide to light a candle. The CERIF people kept looking sad when I berated them about documentation, and all they really had were the XML from their modelling tool (TOAD) and the XSD documnent which it spits out. With some Perl & DOM hacking and lots of advice from them, I’ve managed to produce a CERIF description document which I feel is more useful to code hackers who get twitchy when the only documentation is in PDF and the only introductions are in Powerpoint slides. They got me a couple of pints as thanks, which was nice.

### GtR API

I’ve also been kicking around the API. The things I noticed were some minor inconsistancies with XML naming which I’ve pointed out to them. But they are niggles. There’s more pressing things so here’s my wishlist:

• URI scheme: All (most) stuff in GtR is identified by a UUID but it would be very helpful for creating linksets.
• Data dump location with ALL the data in one big file (maybe put this on bit-torrent)
• In the individual pages put <link rel=’alternate’ > headers and icon on the HTML pages to link to the XML and JSON versions of the information.
• RDF Output (well, I would say that, wouldn’t I)
• Release the code early and often. The current plan is to release code at the end of the project which means no community input to the code will be possible.

Posted in Data, Gateway to Research.

# Agile Documents for agile development

Like a lot of large IT providers the work we do here in iSolutions is often steeped in documentation. This comes in various levels of usefulness from “god send” down to “written but never read” (aka complete waste of staff time). In TIDT our processes tend to be quite documentation light. If a document doesnt serve a purpose to us we do not write it. Less time shuffling paper means more time writing code. However just because we do not have a lot of paper work does not mean we do not have a plan. We work closely with users and develop in a agile way. Because our changes are small and frequent we use need far less documentation per change.

People who do not understand the way we work don’t understand our documentation. A excellent example is a document  (linked bellow) emailed to me by Lucy Green from comms regarding some changes to SUSSED. This documentation is a beautiful example of  agile documentation. It is information heavy, easy to understand and because the change is relatively small it is nice and short. Writing it down serves an important purpose because it gives us an artefact to talk around in our meeting. Because it’s highly visual there are fair less misunderstandings of intent. Documents like this make me happy. It tells me what I need to know. After the change it will serve no purpose, the reasons for making the change will be listed in the iSolutions formal change management documentation a much drier and less well read affair.

Agile documentation

Posted in Uncategorized.

# Adding a custom Line Break Plugin to the TinyMCE WYSIWYG editor inside Drupal 7

This is a long title for a blog post, but it is a complicate and tricky task and I couldn’t find a complete solution, so this is a summary of how I did it. It also provides a good basis for adding other features to TinyMCE inside Drupal. First of all, the versions of the software I was working with were TinyMCE 3.5.4.1 and Drupal 7.14 (yes, we need to upgrade that!) I spent a lot of time hacking inside the Drupal WYSIWYG plugin and inside tinyMCE itself before I discovered the clean plugin-base solution. My starting point was this simple TinyMCE newline Plugin from SYNASYS MEDIA. This didn’t work for me out of the box. I came as only compressed javascript, so I had to figure out how to decompress it first. Once I’d done that, after lots of debugging I worked out that the reason I couldn’t get it to show up inside Drupal is that you have to make a new (minimal)  Drupal plugin to register it properly with the WYSIWYG plugin (see below). After that I worked out that they had used ‘<br />’ which didn’t work in all circumstances so I changed it to “<br />\n” which nearly did what I wanted but the cursor got screwed up if you did newline at the end of the text, so I tried adding ed.execCommand(‘mceRepaint’,true); but that didn’t help. I kept looking at the list of mce commandsand spotted “mceInsertRawHTML” but that was worse. In the end I decided to ignore the glitch as it’s purely cosmetic.

My final version is below. I’ve kept the name “smlinebreak” but I’ve bolded it so if you wanted your own name for a plugin you can see where you’d have to tweak it.

(function(){
tinymce.PluginManager.requireLangPack('smlinebreak');
tinymce.create(
'tinymce.plugins.SMLineBreakPlugin',
{
init:function(ed,url){
ed.execCommand('mceInsertContent',true,"<br />\n")
});
title:'smlinebreak.desc',
cmd:'SMLineBreak',
image:url+'/img/icon.gif'
})
},
getInfo:function(){
return{
longname:'Adapted version of SYNASYS MEDIA LineBreak',
author:'Christopher Gutteridge',
authorurl:'http://users.ecs.soton.ac.uk/cjg/',
infourl:'http://www.ecs.soton.ac.uk/',version:"1.0.0"}
}
});
)();

which replaces the editor_plugin.js in the SMLineBreak I downloaded from http://synasys.de/index.php?id=5. The other files are trivial, just the image for the icon in img/icon.gif and a language file in langs/en.js which looks like

tinyMCE.addI18n('en.smlinebreak',{desc : 'line break'});

This plugin I placed in …/sites/all/libraries/tinymce/jscripts/tiny_mce/plugins/smlinebreak Then I had to register it, not directly with TinyMCE, but rather with the Drupal WYSIWYG plugin, using a custom Drupal module…

## Drupal WYSIWYG Plugin

I gave my plugin the catchy title of “wysiwyg_linebreak”. This needs to be inserted into the filenames and function names so I’ll put it inbold for clarity, so you can see the bit that’s the module name. This module gets placed in sites/all/modules/wysiwyg_linebreak/ and has just two files. wysiwyg_linebreak.info is just the bit to tell Drupal some basics about the module. As it’s an in-house hack I’ve not put much effort into it.

name = TinyMCE Linebreaks
description = Add Linebreaks to TinyMCE
core = 7.x
package = UOS

The last line means it gets lumped-in with all my other custom (University of Southampton) modules so they appear together in the Drupal Modules page. The module file itself is wysiwyg_linebreak.module and this is a PHP file which just tweaks a setting to add the option to the Drupal WYSIWYG module.

<?php
/* Implementation of hook_wysiwyg_plugin(). */
function wysiwyg_linebreak_wysiwyg_plugin($editor) { switch ($editor) {
case 'tinymce':
return array(
'smlinebreak' => array(
'internal' => TRUE,
'buttons' => array(
'smlinebreak' => t('SM Line Break'),
),
),
);
}
}
?>

… and that seemed to be enough. To enable it you first need to go into the Drupal Modules page and enable the module, then go to Administration » Configuration » Content authoring » WYSIWYG Profiles and enable the new button in the buttons/plugin section. Then if you’re very lucky it might work.

## Summary

It’s possible, even easy, to add new features to the editor inside Drupal. I’ve written this out long form as I couldn’t find a worked example myself of how to add such a feature, and it took me enough time I hope this may give a few short cuts to people needing this or similar features.

Posted in Drupal, Javascript.

# Combining and republishing datasets with different licenses

We’ll soon be launching data.ac.uk! Right now it’s all a bit of a work in progress. The plan is for us to start with a few useful subdomains then have other subdomains run by other organisations. Southampton neither can nor should be the sole proprietors.

The goal of the domain is to provide a permenant home for URIs, datasets and services. The problem with the .ac.uk level scheme is that sites are named either after an organisation, or after a project. But a good service should outlive the project which creates it, and if you’re trying to create a linked data resource for the ages then using http://www.myuni.ac.uk/~chris/project/2008/schema/ as your namespace is a ticking timebomb of breakiness.

There’s serveral different projects to create sub-sites right now. These are all focused on “infrastructure” rather than “research” data, but that should not be seen as a firm precident. That said, UK level services for research data are artificial — it shouldn’t matter where good data comes from, but from a practical point of view the UK is a funder of research so there may be times when national aggregation and services are created.

For projects like Gateway to Research to create good linked data they’ll need good URIs. Obviously some of their datastructures are going  to be complex and specialised, but we want solid URIs for institutions, funding bodies, projects, researchers, publications, patents etc.

### hub.data.ac.uk

OK, this is the bit this post was supposed to actually be about.

One of the sub-domains which already exists is http://hub.data.ac.uk/ which is intended as a hub for UK academia open-data services. It has a hand-maintained list fo the current open data services and their contacts. We also set it up that it would periodically resolve the self-assigned URI for each university, and combine the triples it found their into a big document which you could query in one go.

The first problem we encountered for this was that Oxford and Southampton have chosen to make their “self assigned” URIs resolve to short RDF documents describing the organisation [Oxford] [Southampton]. However the Open University made a different assumption of what should happen when you resolve their URI. Their services generates a document describing every triple referencing their university. This isn’t wrong it’s just large and answers a differnt question.

To address this we’ve hit on the idea of asking each open data service to produce a “Profile Document” which may be what their self assigned URI redirects to, but will also be auto discoverable from their main website. This we can (more) safely download knowing more or less what to expect, and we can provide standard ways to describe elements which may be useful to list on hub.data.ac.uk.

### Combining Datasets

The problem I’m facing this week is how to handle combining datasets with multiple licenses.

Right now I’m thinking:

For every source dataset, include a “provenance event” describing where it was downloaded from, and the license on the document that was used as the source.

nb. this is not proper RDF, I’m just explaining my thoughts:

 <#event27> a ProvenanceEvent ;
source <http://www.example.ac.uk/profile.rdf> ;
result <#source27> .

<http://www.example.ac.uk/profile.rdf>
attribution "University of Examples" .
 <#event27> a ProvenanceEvent ;
source <#source20>,<#source21>,<#source22>,<#source27> ;
action <merge> ;
result <>

OK. So the above is true but I’m not sure how useful it is. If I’m using a dataset, all I really want to know is:

• Can I use it for the purpose I have in mind?
• What restrictions does it place on me?
• What obligations (attribution) does it place on me?

So far as I can see, combining datasets with different licenses results in a dataset which is licensed by all at the same time. This isn’t the same as when software is “duel licensed” and you can pick which license, this dataset is simultaneously under several licenses (like wiring them in series, rather than in parallel). Even a “must attribute” license gets out of hand with data from 180 sources (BSD was modified for a reason!)

The licenses we’re plannng to accept (or at least recommend) are, in order of increasing restrictions, CC0, ODCA and OGL.

1. CC0 data only under a CC0 license
2. CC0 and ODCA data only under a ODCA license (with a long attribution list)
3. CC0, ODCA & OGL data under the OGL. (with a longer attribution list)

I’m not a lawyer, but this seems to go with the intent of the origional publishers licences.

There’s also the issue of the ODCA phrase “keep intact any notices on the original database” which would be easy to do if combining datasets by hand, but is going to be very difficult to automate. What if their notice turned out to be in the XML comments in and RDF/XML file?

I came quite late to the Semantic Web, so I suspect much of these issues were discussed a decade ago, so any tips or leads from the community would be most welcome.

All, in all my favorite license remains the “please attribute” rather than “must attribute”. It’s legally the same as CC0, and makes not additional requirements for reuse, but just asks nicely if you could credit the source when and if convenient.

Posted in Data, RDF.

# How to mirror a TWIKI

We ran a few TWikis back in the day and they were pretty good but now we tend to prefer media wiki. We wanted to retire some of our old TWikis because they were putting a lot of load on our webserver. Some of the code isnt very efficient in the version we were running but rather than upgrading we decided to close them and make a static mirror using wget. If you’ve never heard of static mirror or never known how to make one I have always refered to: http://fosswire.com/post/2008/04/create-a-mirror-of-a-website-with-wget/

I searched pretty hard for how to do this best and couldn’t find any kind of useful information. TWiki gets into an infinite loop if you try and spider it so I had to find the combination of arguments to wget which wouldn’t get trapped in a loop but still give me all the important content of the site.

wget -mk -w 1 –exclude-directories=bin/view/TWiki,bin/edit,bin/search,bin/rdiff,bin/oops <site_url>

Posted in Uncategorized.

# Dissappointed by THE Awards

So I’m actually quite excited to be going to the Times Higher Education Awards, as Southampton have been short-listed for outstanding ICT Initiative for http://data.southampton.ac.uk/. When (OK, if) we win, it’ll give us some great bragging rights. Although I’ve met one of the other ICT short-listed teams, as we’re working with them doing cool stuff with equipment data, so I won’t be too grumpy if they win as they’ve done some neat stuff too.

The problem is, what use are these awards? Check out the “Previous Winners” page from last year – it’s bloody useless. It doesn’t even tell you the names of the projects. This entirely fails to promote good practice in the sector, and it would be so easy to link to the winners (and short-listed) teams entries, or better still to their project sites so we could check them out for ourselves. I want to see what other great things are going on in UK ICT and they are failing to take this simple step.

These awards are like if the Oscars announced only that a “Paramount” movie won the award for best supporting male actor, but didn’t bother to tell anybody who the actor was or what the movie is called. That’s a bit lame.

Win or lose, it’s a missed opportunity for us and the other projects involved.

I’ve got to rent a tuxedo for the first time in my life so that’ll be… novel.

*** UPDATE ***

I’ve heard back from them, and they were (a) good natured about my bloggy-banter and (b) seemed to be willing to consider the issue. I don’t think they are going to change the policy, which is a pity, but if they start to hear this from more angles then maybe in time they’ll work out how they can do it.

Posted in Uncategorized.

# Merging WordPress Multisites

ECS had a blog server for some years, home to a number of mature blogs.  As part of the university-wide systems centralisation, these blogs had to be migrated to existing Southampton WordPress server. Patrick and I were tasked with this.

Our initial googling return very little information about this, other than people saying how hard it was, so we decided that it was well worth documenting.  It wasn’t as hard as all that, though we did things that can’t be considered good computer science.

This is presented as a set of instructions, and we’re assuming that there are two multisite installations that need to be moved onto a single new server.  It relies on the database structure that wordpress 3.4.1 uses, so if you have a different version, your mileage may vary.

### Continued…

Posted in Wordpress.