Why I don’t quite trust ORCID

ORCID is an initiative to provide a unqiue ID for every person in the world publishing research. Great, it’s been a long time coming.

It uses entirely non-semanic IDs, with no indication of name, age, nationality or host organisation, making them smoothly transferable throughout a career. This is the right way to do it.

But there are a few things about ORCID I’m worried about.

You can’t use ORCID without allowing Google Analytics to watch you

The way they’ve designed their website, they use Google hosted services. This means you can’t log on properly or use other features if you choose to block Google analytics, which I do. I don’t see any reason Google, Facebook, Twitter etc. should get to know what I’m looking at when I’m not on their sites. I’ve used Ghostery to block web-trackers since soon after ECS graduate Mischa Tuffield pointed out the fact that facebook could track what pages you visitied on the NHS website.

ORCID are aware of the issue, but have not yet addressed it, other than to suggest allowing Google Analytics on their site. My resentment starts to rise at that.

The reason this is such a big concern is that there’s talk of making using ORCID mandatory when publishing some research and applying for grants. If ORCID wants to be a neutral core to the research world it should not force me to disable my privacy settings.

It sounds like they would change this given more developer cycles, but for me it’s a dealbreaker. It might seem petty, but the right for privacy is important, especially when Google makes it so easy to just give in and let them have everything.

No state funded organisation should madate ORCID until they fix this. It should not be too difficult to fix compared with the next, more challenging concern:

It’s really really not politics-proof

Right now ORCID sits on servers in Texas. It’s open source, which is good, but that doesn’t address my concern that it’s not distributed. Only one authority can assign ORCID and they are subject to US law. If there’s a trade embargo does that mean that Iranian researchers can’t get an ORCID? How about ones in North Korea?

More importantly, what happens if there’s a trade war between the EU and USA and they decide to cut of access for EU citizens to US provided internet services?

I would be much happier if there was a way to fork the database & code, and for any authority to set itself up to assign ORCIDs, or at least have an international committee which could award the right to grant IDs to authorities, independent of any one power block.

Distributed systems are more futureproof, and we want the international research community to be pretty darn robust, right?

Number of the Beast?

A final concern for ORCID is that I can see a religous argument for not using it:

“And he causeth all, both small and great, rich and poor, free and postgraduate, to receive a mark in their right hand, or in their foreheads: And that no man might get funded or publish, save he that had the mark, or the name of the beast, or the number of his name.” – Revelations 13

I’m only half joking. It’s a bit big brother, and it leads me to my final concern:

How does this make science & research better?

The benefits of ORCID seem mostly to people monitoring and managing research, and to researchers to self promote for better jobs. We’re building blocks of fundamental infrastructure for the future of the worldwide scientific community so at every step we should be asking does this make research & science:

cheaper for the same quality?
quicker for the same quality?
higher quality results for the same cost?
able to do things we just couldn’t do before?

ORCID doesn’t really do any of those things. It promotes people, not research outputs, and that’s fine, but it’s not going to save the world.

What we really should be focussing on is how do we use the Internet to do better and new science faster and cheaper and effectively communicate it?

I would far rather see this much energy go into Scholarly HTML, but that doesn’t help count the beans, so we’re stuck with the tyranny of PDF for another decade. Yes, I know some services such as pubmed are now doing online publishing in a sane way, but EPrints.soton.ac.uk had 54 HTML documents for the year 2013, and 2179 PDF documents, so we’re still simulating sheets of paper like a bunch of chumps.

Sigh.

On the plus side, ORCID is basically a good thing, and I hope it succeeds which is why I think it’s important to hold it to a much higher standard than a national service, or a non-essential service.

Posted in ORCID, Research Data, Terms and Conditions.

3 comments

By Christopher Gutteridge – June 3, 2014

Digital Democracy

The Speakers commission has made a call for evidence on:

The role of technology in helping Parliament and other agencies to scrutinise the work of government
The role of technology in helping citizens to scrutinise the Government and the work of Parliament
The nature and format of information and data about Parliament and government that is published online

The nifty thing is they’ll except blog posts, so here goes. Apologies for the showing-off bio at the start but I want to state my credentials.

TLDR (excutive summary)

Data on parliament should be published in a model that could apply to any democratic organisation, not a custom vocabulary. Make all information available via a website as well as machine-readable data. Government should design its processes so that data available to citizens is the heart of internal processes of parliament and the civil service, not an afterthought. All entities involved in democracy (people, places, events etc.) should have an official URI.

About me

About me and our team; my name is Christopher Gutteridge, I work for the Innovation & Development team at the University of Southampton, we are the techies who try to implement new technologies. I work closely with the Web & Internet Science research group headed by Professor Sir Nigel Shadbolt. I launched the University open data service for which we recieved the Times Higher Education award for outstanding ICT inniative of the year 2012. More recently I have launched data.ac.uk which aims to provide a hub for open data in UK academia and to provide services to provide more value from open data. Our most notable success has been with equipment data, where we combine open data on research equipment and facilities to provide a combined dataset and search with the aim of improving collaboration and reuse and avoiding unnecisary purchase of capital items. At the time of writing we are collecting open data from 25 universities and research institutes.

I’m focussing on how we would use digital systems to describe the structure and process of government, and not on the publishing of the many sets of statistics and other documents that government uses to make decisions.

A standard non-UK-specific vocabulary

First of all, I think it’s important to recognise that while a few aspects of UK government are unique, it has lots of structure in common with any other government. When describing elements of our government as open data, we should strive to do so in a way that allows one tool to understand data from different sources. For example, a committee in the UK Government can be described in the same way as a committee in a local council, a university student union, or a charity commitee in Kenya. By using generic, reusuable terms to describe each entity, we encourage the development of software that understands this data, both as a consumer and producer.

This will create an ecosystem of technology that can work with the data and software developed to work over the data about the committees in one organisation can be repurposed for another.

The counter case is if every government in the world developed its own bespoke way to describe it’s organisation and operations then every one will require custom software development and many of the benefits of open data will be lost. It would be like every country creating a different gauge of railway.

If the government is serious about “digitising democracy” then the language of this should be, where possible using existing standards, and if they don’t exist supporting the creation of international standards for communicating the process of governance. I know that The Open Data Institute (ODI) has some of the brightest and best in this area.

Once established standards exist then it is likely that low cost tools can and will be created to generate data in compatible formats to that used by parliament. It makes total sense to move towards mandating that all government processes are recorded as semanticly structured data. For example instead of recording the minutes of a council meeting as Word Documents they should be recorded in a database which stores each item and action and logs attendees and apologies. It’s important that such a system should be very easy to learn to use.

Open Data users and the other 99.99% of citizens

The problem with publishing things as data is that inaccessible to the majority of citizens without the mediation of a software tool. It is likely that a number of tools will be created by third parties, but it should be a principle that all information should be available to anybody with a web browser and basic (web & English*) literacy.

[*possibly “appropriate language literacy”, I don’t know if some records may be in Welsh]

This doesn’t have to be a complex tool, with fancy javascript graphing and so forth, but anybody should be able to set out to find a fact through the parliament website and if it exists, they should find it.

Eat your own dog food!

Wikipedia explaination of “Eating your own dog food”.

This is an important principle. The information used by the citizens should, as much as possible be the same information used by the civil service, councils and parliament. Rather than have a watered down view of the “real data”, the citizens should have access to the same basic information as that used to run the government. Obviously there will be parts that are privilaged or confidential, but the heart of the system should be the information available to all. Actually, the core description of our government and its processes should not be something we’re granted, it should be something we, the citizens, collectively own.

The easy solution would be to make openness and transparancy a “bolt on feature” onto the normal business of government, but that’s likely to be cut when money is tight or staff are overworked. I want to see something more radical, which is to say that the core of government should be assumed to be open and available to all, with some confidential processes on top of that. Obviously there will be some parts of government structure and meetings which are confidential for reasons of security, but these should be the exception, not the rule.

A principle here is that the majority of information communicated between departments should be open. Obviously excpetions exist for privacy and other good ethical or legal reasons, but generally I would like to see civil servants using the same data sources as the public. A good example here is how useful http://data.police.uk/ has been to the police’s own staff.

I’m aware that this would take many years to achieve even in part, but the principle is worth striving for as it will get better information for both citizens and government.

Authoritative Identifiers make everything better

One of the most important things the government should be doing is ensuring that the elements of government have authoratative and persistant identifiers. Schemes for identifying things should be independent of the department issuing the identifier as departments of government change, but that should not impact the ID schemes. Many things in the UK already have a clear ID such as charities, companies, car license plates, and that’s good. In addition I would like to see a formal idenfitication code for all the key parts of the democratic process:

Places: Regions, Council Wards, Parliamentary Constituencies

Events: Elections, Committee meetings, committee membership changes, debates, decisions, change of legislation

Legislation: Acts, sections, bills. http://www.legislation.gov.uk/ is doing pretty well on this front.

People: MPs and other elected officials should all be assigned a unique governent ID which should persist in identifying them in all public roles. However civil servants have more of a right to privacy and under some circumstances it may be better to identify them by their role than by the individual person.

A role: MP for Southampton Test, Chair of the Badger Goalpost committee, Minister for Cheese and Wine. These represent the position, not the person(s) filling the role. A role may be empty at some times. A role may have the dates it was created and disolved.

Person filling a Role: A relationship to an organisation or a responsibility, such as a membership (or chair) of a committee. It’s useful to assign a unique ID to the actual relationship between a person and role as it may have metadata such as a start date and end date, and whom they succeeded and by whom they were succeeded.

The reason it’s so valuable to create all these IDs is that the process of transparancy doesn’t stop at the boundries of government. It gives us, the citizens, the ability to unambiguously review and annote the business of government. It means that diverse groups can provide tools that augment the process. By using the authoratative IDs for each element of the democratic process these systems will be naturally compatible with each other and those of the civil service, once again the analogy of using the same gauge of railway is apt.

Oh, and those unique IDs should be URIs, of course. Although there’s the question of what happens to data.gov.uk URIs if Scotland goes its own way.

Thanks for the opportunity to comment on this. I’m excited to see what happens next.

I’d encourage friends and colleagues to chip in with comments if I’ve missed something important.

Posted in Best Practice.

1 comment

By Christopher Gutteridge – May 30, 2014

Jisc: Regeneration or Rebranding?

In the last year JISC seemed to disappear. They have endured an internal restructring, and big cuts (in the name of austerity, I assume), and several of my friends lost their jobs when UKOLN was closed down. The good news is that the people at UKOLN seem to have all gone one to good, interesting jobs in other parts of the sector. To find out where some of them landed, see the UKOLN-diaspora

At Southampton we have historically had more than “our fair share” of JISC money. Part of the reason for that, I think, was that we took on projects that were realistic, we ran them with the goal of producing something useful, rather than just to tick the boxes to fulfill the requirements, and that we staffed them with people who were primarily developers, rather than researchers. In the last year there’s been very little funding for projects like this, not just for us, but for the UK. This means that some very talented people have had to find jobs outside the sector. Rebuilding this skillset will take us many years even if we start now.

First they cancelled Firefly, then they cancelled Dev8D, what next? Bacon?

mmm, hyperbolicious.

Some of these impacts were not obvious in advance, and a few things occur to me in retrospect.

Small JISC projects were like startups; most fail but a few can pay off big.
Small JISC projects let people fail, and failing is how you learn not to fail next time. Many .ac.uk developers had the opportunity to learn their craft without being under the yoke of university central IT which by its nature is risk-adverse.
Small JISC projects were sometimes exploited, I hear, as a way to keep paying a promising young researcher, in between the big projects. While this is an “exploit”, it may have prevented losing good young researchers who can’t afford not to be paid for 6 months between two big projects, so actually really have benefitted the sector.

Who moved my Cheese?

Those days are over. In the big review of JISC, the mood was very much against these little projects. The days of things like RedFeather getting funded are past and we need to get over that and work out how to work in the new world order, where Jisc (ne. JISC) will, I understand, be looking at bigger projects and providing services to the sector.

What we need to do is to still make sure there is still a career path from “comp-sci graduate” to “experienced innovator”. There’s a couple of ways we can do this.

Idea 1: Accept that innovation is part of the costs of running a university

First of all, universities need to realise that they need to grow this talent in-house. You can’t afford to hire people with the required talent and passion unless they have near zero experience or by sheer luck have to re-locate to your city. This is, I believe, also a very good way to get more female developers. If you can give women engineers a chance to love the subject and do cool things then they’ll love the subject and do cool things. As will the male ones. It may be that with the new approach to Jisc, university management need to wake up and realise that hiring, nurturing and retaining this talent is now something that has to be done with their own money, not JISC grants. This is going to be a difficult change because the obvious home for this is the university IT department, but the primary goal of the IT department is to provide reliable services at an affordable cost. Innovation is disruptive, and many IT managers will be spending time ensuring it doesn’t happen. I had a wonderful meeting once where I was asked by an IT manager at Southampton what best practice for university open data sites. His usual (and very reasonable) approach would be to find out what the leader(s) in the field are doing and learn from them. It was a culture shock that we *were* a leader, so were kinda making it up as we went along and learning from our mistakes.

At Southampton, I am part of the “Technical Innovation & Development” team. Not every university has one of those, and sometimes people want to sort us into our appropriate “boxes” in the normal IT structure, Networking, Databases etc, but so far that’s not happened and we’ve got enough wins under our belt that us bimbling along in a chaotic-good kinda way is now easier to justify.

One thing I don’t like, but don’t know how to change, is that we’ve started hiring people for a fixed term project. When the money runs out we may lose those staff and the insight they’ve gained into the organisation. It would be nicer to hire people and then have them work on a project, but as part of the larger team, and when that project ends they rejoin the pool rather than just leave. It would also mean they were more likely to help with other team member’s projects and vice-versa. This way it’s not their time, specifically, on the project, but the team as a whole with them being the one foucssed on it. This is not easy when the budgets for permant staff salaries are being squeezed, so I’m not sure how realistic my idea is, but it would be nicer and better.

Idea 2: Split too-big-to-fail projects, into sub projects

If we’ve taken, say, a bajillion pound grant to design and deploy a national video conferencing network, then there’s no way we can allow this project to fail. It would be a reputation disaster and harmful if it didn’t happen. However we can look at this with more agile eyes and work out what the essential parts of the project are, and make sure those are more than fully resourced, but then take lots of other features, and give them to less experienced developers, who can start gaining that vital experience, but if their part failed entirely the main project would still deliver, but with one less bullet point on the list of nifty features.

This idea is something the new Jisc can help with. I would like them to accept that you can’t, upfront, know exactly how a really big project will go, and sometimes as you go you discover opportunities which would not have been obvious up-front. The advantage of an in-house project is you can alter course, however with a funded project you often have to stick to the course you agreed because that’s what you were paid to do. That doesn’t get the best results. I’ve seen a few projects in the past that failed by succeeding. That is, they did exactly what they set out to do and ticked all the funder’s boxes, but actually learned that they should have done it a bit different but didn’t feel they could because that’s not exactly what they had been funded to do.

So what I want to see from Jisc, on these big new initiatives, is:

Don’t write a detailed specification and make the project stick to it, allow them to negotiate adjustments as they learn the best solution.
Get them to deliver early and often. Often the vital feedback on a project only comes after it’s been in use, and all the funding is already spent. Also, anything that’s open source should be open source from day one, not released at the end of the project.
Allow the big project to have experimental aspects which may fail. Do not force the entire project to be risk-adverse when it isn’t part of the core functionality of the project.

Failing fast saves money. If an idea doesn’t pan-out it would be better to ditch it and try something else rather than to spend way to much time and effort on that one requirement. Often the project will be better served by cutting the loses and using the money & time saved to do two other nifty features that can actually deilver.

Cold dead eyes?

I stole this image from Rachael Berry who is an amazing artist and you should buy her work.

One of the things I’ve heard much muttering about is that Jisc doesn’t care any more. For the past two days I’ve been at Jisc Digifest, which is sort of the relaunch party after a year of internal changes. If you believed some of the stuff I’ve heard over the last year, then the new Jisc is a soulless beast that cares nothing for the developers and will chew us up and spit us out.

Nothing could be further from the truth. The biggest challenge I had in talking to Jisc staff about the difficulties of the past year is that I kept making them cry (really, and I’m so very sorry). They are very aware of how their being off the grid has distrupted our communities and care deeply about how we can make things good under the new guidelines from on high.

What we shouldn’t do is just “wait and see what they do”. What we should do is engage with them early and often to understand their new objectives, and how to best design the new system to get all the secondary benefits we can, while achieving the new goals that resulted from the review.

So please remember, whatever you think of the new policies, that Jisc staff are not your enemy. If you are angry at Jisc please direct this to their upper management, and the people who set their targets. It’s amazing that the staff have not become hardened and cynical this past year, please show them some love, they need it right now.

Looking fundable

Right now I reckon that they are going to be a bit risk adverse. They can’t afford to have any high-profile turkeys for a while. With that in mind I made sure that I had a raft of nifty work already started at data.ac.uk which they could fund-to-improve rather than fund-to-start. Equipment.data is funded by EPSRC, the rest of data.ac.uk is done partly in my own time, partly at work as 20%-style project.

I’ve been a bit shameless in pointing out to Jisc staff all the niftiness we have that could be funded to fan the sparks into a fully realised and supported services or products. I would have never thought like this in the past, but right now I share my working space with two team-members both on fixed-term contracts so, for the first time in my life, this funding-thing is very real. They are my friends and have mortgages to pay. That said, by the time they get to the end of their current contracts they’ll have a buttload of experience in open linked data for organisations and right now that’s a very rare skill that’s going to be increasingly in demand.

I can only really do this due to my position as a permanent member of innovation staff at the university. The people on fixed contracts need to (mostly) focus on what they are funded to do. I think this is going to be a sensible model for trying to gain Jisc funding; invest some time into building a proof-of-concept, or even starting a basic service as we have done. The first few iterations of an agile project. This massively reduces the risks of the money being wasted, and means that the university accepts the cost of a bit of innovation in the hope that some of it will gain funding later. It does require a good understanding of what the objectives new-Jisc has for their money, and that is something I don’t fully grok just yet; I need to see if the DigiFest keynotes can give me a bit more insight. I missed them because I was manning our booth.

Rebranding

One thing I wasn’t a big fan of was that there was a bit of a feeling of “spin” at Digifest. The festival theme was a nice idea but didn’t quite work when everything is still a little tense. Jisc are trying hard to present an image, and I’m not really buying it just yet. I want to see what they do rather than what the image they try to put out.

I didn’t have a chance to see much of the talks at the digifest, but had many very interesting conversations with interesting people. Some gave me ideas, others I gave ideas to. I only caught two talks, those from Joss Winn & Paul Walk (both Dev8D regulars). Both talks were about trying to bridge the gap between the developer and hacker mindset and university policy and management. Good stuff. Especially if I want a hope of one day rising above “Senior technicial specialist” without just giving up vi for good and becoming a people-manager.

Regeneration

Change is difficult, but I’m choosing to be optimistic until proved wrong.

So right now Jisc as an organisation have a lot to prove and some trust to re-win. I’m choosing to give them the benefit of the doubt and accept things are going to be different. I will have to let go of some of the things I liked about JISC in the past, but I’m pretty sure there will be some exciting things which they will be able to do now that they couldn’t before.

The Dev8D community is still out there, we still talk and support each other and would love the chance to run events again to grow and strengthen the community and most importantly to meet the next generation of young developers.

A final bit of good news is that the Institutional Web Management Workshop (IWMW) is un-cancelled! It’s going to be run by CETIS instead of Jisc, but it’s great that it’s survived the kerfuffle.

Posted in Events, Jisc.

14 comments

By Christopher Gutteridge – March 13, 2014

Dialects, Jargon and RDF

There’s a problem I encountered some time ago, and then more or less forgot about, but other people are having similar challenges so I thought I’d try to articulate it.

A bit of background about RDF literals

(if you know RDF well you can just skip this section)

The RDF way of structuring data allows you say several things about that string. The most simple version says nothing, it’s just a list of characters:

"Hello"

Then you can assign one of the common XML style datatypes:

"Hello"^^xsd:string .

"23"^^xsd:positiveInteger .

"1969-05-23"^^xsd:date .

The bit after the ^^ can actually be any URI, so you can have

"2342A-1.3"^^
    <http://example.org/vocab/vendtechtron-product-serial-number> .

(nb. a lot of things which are identifiers get called a “number” which really are just a string of characters).

The final variation is a bit weird. You can indicate that a string is text in a given language. eg.

"Hello!"@en .

"Bonjour!"@fr .

And also specific variations of languages, such as

"Hi, parner!"@en-us .

"Wotchamate!"@en-gb .

You are not allowed to set both a language and datatype on a single literal so.

“XYZ” or “XYZ”@en or “XYZ”^^<http://foo.com/bar> are all legal but “XYZ”@en^^xsd:string is not.

I’ve never really understood why the designers didn’t use defined datatypes for languages, eg.

"Hello"^^<http://w3c.org/ns/lang/en> .

I’m sure that internally most RDF systems probably optimise datatype & lang to be a single variable internally.

Other dialects

The problem with this very simple attitude to language is that it misses how subdivided dialects can become. For example

University X has a thing they do which we’ll describe as “a unit of education for which a student may enroll, for a fee, and may receive an award”. They call it a “course”.

University Y doesn’t have courses, it has “presentations”, however semantically it’s the same thing.

We can easily define a URI for this class, say <http://example.com/vocab/EnrollableLearningUnit> but I want a way to describe the label appropriate for university X users and university Y users.

Option 0: Ignore the problem or enforce a national standard

Included but not really an option because THIS IS NOT THE WEBBY WAY! The web works because it can cope with the fact different systems don’t all work exactly the same way, but can still link up.

Option 1: Separate label datasets

I could provide each university with a local terms file to include, but that’s a bit of a disaster as they can’t safely merge their data.

eg. University Y gets a dataset with data like:

<http://example.com/vocab/EnrollableLearningUnit> rdfs:label “Presentation”.

Option 2: Invent datatypes for these dialects

<http://example.com/vocab/EnrollableLearningUnit> rdfs:label 
   "Course"^^<http://id.uni-x.ac.uk/vocab/our-way-of-describing-stuff>.

<http://example.com/vocab/EnrollableLearningUnit> rdfs:label 
 "Presentation"^^<http://id.y.ac.uk/vocab/term-in-our-dialect>.

I guess this isn’t too bad, but it’s not very intuitive.

Option 3: Invent our own language codes

<http://example.com/vocab/EnrollableLearningUnit> rdfs:label 
    "Course"@en-uni-x .

<http://example.com/vocab/EnrollableLearningUnit> rdfs:label 
    "Presentation"@en-uni-y .

This is going to break things somewhere. I wouldn’t recommend it.

Option 4: Model it in RDF

We could actually assign a URI (or blank node) to the concept of the label and then use the RDF structure to explain the difference.

<http://example.com/vocab/EnrollableLearningUnit> dialect:label 
    <http://example.com/vocab/EnrollableLearningUnit#label> .

<http://example.com/vocab/EnrollableLearningUnit#label> a 
    dialect:DialectSpecificText .

<http://example.com/vocab/EnrollableLearningUnit#label> 
    dialect:text "Presentation"@en .

<http://example.com/vocab/EnrollableLearningUnit#label> 
    dialect:inDialect <http://id.y.ac.uk/vocab/our-dialect> .

This is sorta elegant until anybody tries to actually use your data.

Option 5: Use a predicate for each dialect

<http://example.com/vocab/EnrollableLearningUnit> dialects:labelForX
   "Course" .
<http://example.com/vocab/EnrollableLearningUnit> dialects:labelForY
 "Presentation"

This would certainly work, but it’s ugly and would make consuming the data fiddly.

Which option?

I have no clue. That’s why I’m writing this blog post. Labels (and descriptions) aimed at different audiences is not something I’ve yet seen done nicely in RDF.

This problem isn’t going to go away any time soon. At Southampton,w hat our students call “a degree” or “course” (eg. 3 Year BSc Computer Science”, the student admin are more likely to call a “programme theme”, and the underlying database is US-made so calls it “MAJOR”.

As a community we need to solve this at some point as there really is a good reason for audience specific labels and descriptions beyond simple national language variations.

Posted in RDF.

4 comments

By Christopher Gutteridge – February 14, 2014

Location Independent Software

When I started working for the university, everything I developed was designed to work on a single, well known set up. I didn’t have production, pre-production and testing set ups. I just hacked away and if I made a mistake I fixed it quick. Some of the ECS Intranet still runs ike that and it’s not a terrible system for a small team with a userbase who care more about getting new features than they do about 100% uptime.

However, I’ve learned over time that there’s lots to be said for being able to develop and test software separately to the live setup, and more importantly just set up a new install for someone to develop on on their desktop. Many of the new services we develop are set up like this, not as many as I’d like.

Sysprog vs Developer

I’ve been trying to train myself out of the habbit of “hard-wired-paths” in my code. It’s as easy to avoid as unnamed magic constants, if you clean up as you go. It’s also pretty easy to retrofit to an existing setup, if you can use grep well enough.

Basically:

BAD: open( “/home/systemx/v2/etc/config.json” )

GOOD: open( $BASEDIR.”/etc/config.json” )

I should note at this point that I generally design applications for db and web, and they almost always have a /var and /etc directory under the base path of the install, rather than scatter the install around the filesystem. I like stuff which I can “git clone” or “tar xzvf” and have it work in place, it makes it easier to commit changes back to the git repository.

Basepath

If it’s a web-based app then you can put configuration into the apache vhost configuration, to set the base path of where the application was installed, but you don’t need to. Most languages I use can work out the absolute path of directory the script you are running was in, and from that the relative path to the base install. eg. if the script is /where-you-installed-the-app/bin/process then you can get that path from within the “process” script and then just add “/..” to get the base directory. If the script/library is deeper in the install path then just add a “/..” for each additional level.

For my own benefit, as much as anybody else’s, here’s a cheat sheet of my most-used-language’s ways of doing this:

Perl

use FindBin;
use lib "$FindBin::Bin/../lib/perl";
my $BASEDIR = "$FindBin::Bin/..";

PHP

Thanks to Andrew Milsted for a much more concise version than mine.

$base_dir = dirname( __DIR__ );
require_once( "$base_dir/lib/php/utils.php" );
require_once( "$base_dir/settings.php" );

BASH

This one I learned more recently (bless stack overflow and all who sail in her):

BASE_DIR=`dirname "$( cd "$( dirname "${BASH_SOURCE[0]}" )" && pwd )"`

Others?

If you can contribute similar nuggets for other languages; Java, node.js, Ruby, Python, C, etc. please leave them in the comments.

Secrets and local stuffs

This may sound obvious, but you don’t want to commit your passwords, and other secrets, into the git/svn repository. Especially if it’s on github.com!

There’s also other things which may vary between two copies of the same tool; locations of libraries, the URL (if it’s a web thing) and maybe some other stuff. The way to do this is to move all of this to one config file and then carefully make sure that it is never checked into the version control system. To make life easier, you can make a fake version, usually called something like config.ini.template which has a reminder of all the config elements needed, but not real passwords, and you check that into the repository so when you do a new install, testing copy or whatever, you just copy that to config.ini and edit it.

Hostnames

A good example of me cocking this up recently was we cloned the entire virtual machine that a service was on, so that we could have a development environment. That meant we clone the mysql database, right along with it’s username and password. All of that would have been fine if I’d used “localhost” as the hostname, but I used the absolute domain name of the machine… so for a little while the scripts on the test server were updating the live database. We caught the error and fixed it and no damage was done, but it added to the ever growing list of “things to look out for next time”.

URLs

The URL of a website should also be configuratble. We’ve had a few occasions where we did this, but the template file had absolute URLs linking to javascript or css which had to be unpicked when discovered causing confusion.

For extra bonus points, allow the software to be installed so that it runs on a subpath of a domain, eg. http://example.org/myapp/ rather than just assuming it’s going to get domain name to itself, which is easy for some people but a showstopper for others. If you don’t design everything from the ground up with an adaptable URL path in mind then this can be one of the most time-consuming things to retrofit.

An advantage of just instisting your application runs a the base path of the doman: http://myapp.example.org/, is that you can write the code without the domain name at all, just make all the links relative to the base of the site “/foo/bar.css”. There is a reason this can suck though; if you produce any kind of API which exports HTML to other sites (many of our tools at the uni package up HTML chunks to be used elsewhere), then the HTML links will be broken when shown on the new server. This is most commonly manifest in missing javascript or images.

bin/ directory style

These are petty, but I’ve decided I have some thoughts on good practice in /bin/ directories.

The first is to use hyphens in command names, not underscores. My team tends to use them interchangably, but the command name is a verb, not a name. Underscores are fine for passive cattle like config files. I admit this is pretty arbitrary, so the biggest no-no should be mixing in the same directory, something I am guilty of myself.

I also suggest not adding filename suffixes to bin (command-line) scripts. They are not needed on UNIX, Linux, OSX setups, as the script uses the #! to determine what to run. There’s a good reason for this. I’ve had people be nervous of using a command I wrote in Perl because they “don’t know Perl”. It sounds silly, but by removing the suffix you are indicating to the user of the script that they don’t need to care about how it works, just what it does.

Following on from the bit about absolute paths, don’t put absolute paths in the #! line if you can help it. There’s incantations for working out the correct location of tools like bash & perl. Also, it’s really a pain to have something like “#!/usr/bin/perl -I/opt/eprints3/perl_lib”. That EPrints example was one of mine, and checking the EPrints project commit logs, it was tidied to using “FindBin” by Tim Brody in 2010 (thanks, Tim).

#!/usr/bin/env xyz

There’s a technique I’ve seen used, which is to not use

#!/usr/bin/perl

but rather

#!/usr/bin/env perl

…which seems to use the PATH to work out where to find the perl command. While this is kind of clever, I’m not convinced it’s 100% safe (what if the user had a weird copy of perl in their path?), but I’d be interested to know if people can tell me how safe it is.

Other tips?

Have I missed any obvious, or not-so-obvious, best practice or handy tips?

Posted in Best Practice, Command Line, Perl, PHP.

Tagged with Tips.

2 comments

By Christopher Gutteridge – January 31, 2014

Open Data Metrics

Happy new year.

First of all, welcome to a new team member, Andrew Milsted. Andrew’s post is funded by HEFCE to work on developing equipment.data.ac.uk. He’s busy right now turning my 0.1 code into something a bit more suited for the long term, and as a side effect we hope to release some libraries which will be reusable in future projects.

So right now our little team is:

Christopher Gutteridge & Patrick McSweeney – General Web Medlling for ECS, the Faculty of Physical and Applied Sciences, and in general.
Ash Smith – Linked Open Data Specialist for http://data.southampton.ac.uk/
Andrew Milsted – Open Data Specialist for http://equipment.data.ac.uk/

We are all situated in the Level 4 labs in B32, and are part of the iSolutions “Technical Innovation & Development Team”.

A big problem with open data…

…usage metrics.

The problem of usage metrics came up with data.southampton.ac.uk, but is a far bigger concern for equipment.data.ac.uk as we need to be able to show an external funder that we’re having an impact.

We conform to (some) good Linked Open Data principles, our URIs resolve (mostly) and our data can be downloaded as a big single file; no noodling with APIs is required.

Local caches of the data dump

However this means that 3rd parties can snapshot our entire database and use it on things like their internal Intranet. In fact we do this ourselves; our university Intranet, SUSSED, has a search box which searches a mash up of the open data list of equipment from from equipment.data.ac.uk and some additional (non-open) data we get from some strategic partners. The catch is that we’ve got no way to know what’s been searched for there, or how often, so we’re losing all that juicy business intelligence.

This is going to be a sticking block in future. Organisations may decide that their statistics gathering is more important than being fully open so may require all searches via an API to try to discourage 3rd parties being able to search the whole dataset. That sucks, but I can see why a manager would make that decision.

Google Analytics can’t help you here

Most non-technical web managers seem to use Google Analytics as their go to stats system. I can see the cookies on many .ac.uk sites I visit. The problem is that it doesn’t work if you leave the HTML world. Many or most of the hits on our datafiles will be by automated scripts downloading them, or by semantic tools resolving the data for an entity, such as http://equipment.data.ac.uk/item/378b5a86ab130959dd62a68b9b7110a1.ttl (nb, .ttl files don’t open in the browser)

Those requests can only be tracked by the software running on the server. And they are, but it’s hard to know what to do with it. It would be handy if someone could make a semantic website aware way to resolve an apache log.

Sidebar: Should semantic tools honour robots.txt? Are they robots or browsers?

You can’t use cookies even, because most of these requests will be by naive robots who don’t store and return cookies between requests.

URI resolution patterns

Right now many open data sites use Tomcat (Personally, I’m still on Apache 2 and very happy thankyou-very-much). We usually use a stand alone domain to resolve 303s for URIs. I prefer this as it is more obviously a distinct service. eg. compare our open data service with dbpedia. In my opinion, ours is a better model

DBPedia
- Identity: http://dbpedia.org/resource/Southampton
  (303 redirect to)
  - HTML Page: http://dbpedia.org/page/Southampton
    (or)
  - RDF Document: http://dbpedia.org/data/Southampton.rdf
data.southampton.ac.uk
- Identity: http://id.southampton.ac.uk/building/59
  (303 redirect to)
  - Document URI: http://data.southampton.ac.uk/building/59
    (302 redirect to actual format)
    - HTML Page: http://data.southampton.ac.uk/building/59.html
      (or)
    - RDF Page: http://data.southampton.ac.uk/building/59.rdf

Admittedly we could skip the document URI for speed, but this keeps things very clean logically. And it’s easy to tell at a glance what’s a URL and what’s a URI.

The problem is that I don’t know a nice tool which could study the apache logs from id.southampton.ac.uk and data.southampton.ac.uk and make some nice graphs. Interpretation would need to be very different to the assumptions you make about a human browsing web pages, and you don’t get handy things like “referrer” in the headers when a cheap semantic tool is poking your URIs.

Quantity vs quality

The other problem of equipment.data.ac.uk is that all hits, and all searches, are not created equal. Our goal is to improve the value the UK gets out of money it has spent on research equipment. Either by enabling more full use of items, avoiding buying un-needed kit (oh, turns out there’s already one in the building next to ours!), or by encouraging new collaborations.

The problem is that these “wins” will come days, or months, after the visit to equipment.data. All the obvious methods for capturing feedback will make the site less user friendly; collecting their contact data to email them follow up surveys etc.

The thing is, if once a year, a single search saves the country buying a single £500K item of equipment, it’ll have been a resounding success and paid for itself many times over. However, I have no clue how to measure that so we can report it back to the funders.

Uncertenty

In marketing, you talk about “conversions”; how many people who visited the website actually made a sale. The problem is that we don’t have a way to know when we’ve had a success, and with people caching the open data we might not even know when and how people are using our data!

I guess the frustration I have is that to properly measure these services, we would in some way have to compromise something that makes them good.

Are there solutions? Can web science help? Should we compromise linked+open data principles in order to be able to better quantify our sucess?

Posted in Best Practice, RDF, web management.

Tagged with Data.

4 comments

By Christopher Gutteridge – January 7, 2014

Taking my Leave

I’m not very good at taking holidays. Every year my line manager has to remind me to take some leave. The problem is that I actually enjoy my job. My friends now all have jobs or families so there’s not people I can just casually hang out with if I take some days off. I don’t have kids, so don’t have that kind of need for structured and planned holidays.

I’ve found some ways of actually taking and enjoying leave, and they may be of interest to others.

The key issue is my martyr-like belief that everything will go horribly wrong if I don’t pay attention to what’s going on at work.

That is, sadly, quite true. I’ve been away from the office for about 2 weeks (counting in the Repository Fringe unconference). In that time, I’ve had a few urgent work things:

To review the final version of a job description we need to get out urgently
The place my immediate team were relocating to got changed to somewhere less desirable. If I’d not intercepted that we might have had 33% less floor space than initially agreed, for the next year or two.
An enquiry about requirements for another possible position, which would directly relate to my area of responsibility. (but as sorting out stuff for jobs takes ages, that’s less worrying)
Being asked to sign off on a description of a training event I’m running in October (asked at 5pm to sign off on it for next day).
Uploading data into a new system so people can start working with it ASAP.

Most of these can be achieved via half an hour’s attention, and I feel much more relaxed for knowing they are in hand rather than fretting. Some of these really did require my urgent attention, and if I’d not answered the emails I would be the one who ends up in the worse situation.

These new positions are a result of our work being seen as a success by various powers, which is great, but some care and attention is needed to ensure they actually achieve the actual intent behind creating them.

The trade off is that I’m not often in the office at 9am on work days.

Email Triage

I’ve been getting more and more organised about how I handle emails. On average I get maybe 80 a day now, which is much lower than the 150 of a few years back.

My first strategy has been to unsubscribe from everything I don’t really care about. I actually *do* want emails for “you have a direct message on twitter”, and if you message me on facebook it emails me so I don’t have to monitor my facebook inbox. However I’ve lots of random twitter accounts and these I have told to stop emailing me updates. Anything with an “unsubscribe” link gets clicked before I delete it. God damn you and your call-for-papers.

The key change has been to move to a two-tier inbox system. I now use my INBOX only for actually incoming messages, and they are either deleted, moved-to-archive or moved-to-TODO. Even if it’s just a case of “I need to read that but it’ll require my full attention” it goes into TODO. About 95% of messages get deleted or read and moved to archive.

The problem with this is is that it’s actually still quite a pain to move a message to a folder. When Thunderbird introduced a much better search tool I gave up using lots of folders for my email archive and just use the “a” key which moves the email to archive/2013 (the current year). This still left me with the faff of dragging stuff into the TODO folder and so I looked into how to speed this up.

Dial “T” for “TODO”

I spent a bit of time looking into custom keys for Thunderbird and found “keyconfig” which does exactly what I want. It runs some javascript when you press a key (or key combo). In this case I’ve mapped “T” to:

MsgMoveMessage(GetMsgFolderFromUri(“imap://cjg@imap.ecs.soton.ac.uk/_TODO_”))

(My TODO folder is called _TODO_ so it sorts first, alphabetically).

This means I can now triage my inbox 3 or 4 times a day in a very short amount of time, even when on leave. It means I know I’ve got a bunch of stuff in my todo list, but none of it urgent. Sometimes I leave genuinely urgent stuff in the INBOX, but it’s only one or two items so I don’t have that feeling of amorphous dread that I used to get from knowing some of those 300+ messages were important.

Vacation Message

I’ve noticed that after a couple of weeks away from the office, my vacation auto-messages are beginning to have an impact. People know I’m away and are emailing me much less. I’m seriously considering next time actually putting a “I’m going on leave” into my .sig file a week before hand next time I take a long leave.

Delegation

I don’t like the idea of leaving people hanging. Some of the emails I get are about serious problems and if these are only addressed to me (not cc’d to someone else), I’ll reply to say I’m on leave and cc someone else on the team who might be able to deal with it. This way I don’t have that nagging sense of guilt that my taking leave is causing problems. (yeah, martyr complex yadayada)

Ventnor Fringe

This is the key part of how I’m actually keeping from worrying too much about what’s happening back in the office. I’ve volunteered to help at the Fringe festival in Ventnor, the Isle of Wight town where I grew up. Much of this involves some similar work to what I used to do for Dev8D. You’ll notice the programmes look quite similar: dev8d, vfringe. The vfringe one is a much more developed version of the tools, but it’s also been heavily hacked.

At my suggestion, Joe (the VFringe Webmaster) used Drupal for this year’s site. He’s done some great work configuring it and is using it as a database of events and a way to manage memberships & volunteers. Also, because it’s a full CMS, most of the core team can log in and update details.

Being an alleged “web expert”, I’ve done my best to be helpful without treading on the toes of the young people organising the event. I’m here to help, not take over. This year I’ve come a whole week before the festival starts (staying with friends to keep the costs down) and that has worked much better than trying to do stuff remotely. My concerns about whether or not they really even want my help was allayed when the first thing they did when I showed up was given me a full-administrator account!

In previous years, we’ve used a Google Docs spreadsheet to collect the data for themes, locations and events. This year we’re pulling the list of events directly from the Drupal site. (I’ve made a custom view containing all the data I need, then munge that into usable data). There’s a few issues. The key one is that while an event can have multiple start times, the node-type allows only one location. Also, Joe used a free-text-tagging system for the start times, which has produced a variety of horrors. A mix of 4pm 4:00pm and 4.00pm. Most of them could be read with a well-tuned regular expression, and the last handful I fixed by hand. If we had it over, what we’d really like is if we could have a list of performances, per activity, where each one has a distinct start date+time, end date+time, venue and prices. Chances are someone may have thought of this in the past, if not, it’s a handy plugin to make.

I spent a couple of days just refining my fancy views of the programme, like the day planner, a all-in-one-page programme designed to be saved on your mobile device in case of connectivity issues, and a “what’s on now/next” animation which appears on the homepage. All my bits and bobs use RDF & Graphite as a backend, but that’s just my usual toolbox, nobody here will be consuming this open data except me.

I spent the best part of a day using a pub wifi and keying in all the little events and bar opening times etc. into the site, so that they all appear in the programme when people are planning their day. This was done while drinking some nice, but not-to-strong, beer, and with a view over the beach to the channel. This doesn’t sound like everyone’s ideal holiday but it gives me something tangibly useful to do outside of work and that helps me actually relax. It’s not very difficult, I’m good at it, and every hour I spend doing the grunt web-work is an hour I’m freeing up for one of the organisers to, well, organise.

I also spent half a day helping decorate one of the festival bars. A bit of honest and useful exercises right on the seafront is also a refreshing and satisfying way to spend my time.

I also ended up handling an issue of video encoding for them. They made a 2 minute video to be screened on the Red Funnel ferries, so that people on their way from Southampton to The Island will be made aware of the Fringe Festival. It was a bit of a nightmare, and I’m glad it was me dealing with it. The correct encoding, should you ever need to make an advert for the ferries, is an mpeg-encoded .avi file.

This is not a technology-focused festival. It’s about poetry, art, music and drama. Hence most of the organisers are very different to what I’m used to in Library, Data and Programming events. But it also means that my specialisms go much further and I can save them from work, and do things they just plain couldn’t do otherwise.

If I could change one thing, it would be to buy them all little policeman-style notebooks, with general notes and a note page for things they each need to do morning/afternoon/evening each day leading up to and during the festival. They rely too much on their memories and this makes them stressed as their memories of all the stuff they have to do is a bit like my relation to my INBOX a couple of years ago. There’s tons to remember, and a few really important things, but it turns into a stress-inducing blob in memory. They are not great at keeping people informed of what’s going on, and that’s a hard skill to master. A key thing is getting back to people when you said you would, even if there’s no news. That way people know they are in the loop. Otherwise they fret as they assume they’ve been forgotten. It’s a tricky relationship as there’s over 300 people involved in various capacities, and for each group, their performance is absolutely the most important single thing to them, but only one of many for the festival organisers.

Did I mention that the oldest organiser of this 4 day, 21 venue festival is only about 22 years old? The work they are doing is fantastic and I’m proud to be able to help.

Today I’ve done must of my helping so I’m off to the Woodland Bar, they’ve built in the street where I used to live, and will finish off my comic. It’s the last story in a comic which has run for 25 years. I’m not anticipating a happy ending for John Constantine. This evening I’ve got tickets for a magic and burlesque show and after that there’s a lock-in at Ventnor library with spoken word performances compered by Dizraeli of Dizraeli and The Small Gods fame. I worked my way through the science fiction and computer programming books as a teenager. You kids with wikipedia don’t know how damn lucky you are!

Volunteer

To get back on topic, I think finding something useful to do with your leave is really effective as a way to stop worrying about work. I know at least three techies who have worked on shows at the Edinburgh Fringe. I think there’s real value and satisfaction in us dev’s using our skills in volunteering & charity work rather than just donating a monthly amount or doing unskilled labour for them.

So I’ve actually had some genuinely refreshing leave. I’ve learned a bit about Drupal, and haven’t just spent my whole time playing Minecraft or in the pub.

Now I’ve just got to find a way to use up another 14 days leave before they expire on October 1st…

Posted in Conference Website, Drupal, Events, Graphite.

Tagged with Data, Tips.

3 comments

By Christopher Gutteridge – August 15, 2013

Retreieve by Path feature in the works for Graphite v2.0

I’ve been working on a rather funky new feature for Graphite v2.0. It’s a logical extension of the existing Resource Description feature, but it won’t replace it as th resource description feature lets you describe very simple paths out from a resource in a SPARQL endpoint, which gives a clear tree structure suitable for turning into JSON so that’s still kinda useful.

This new function uses a modified version of the SPARQL 1.1 path syntax. I’ve added a couple of features and cheated a bit, but it’s damn useful.

foaf:name or <http://xmlns.com/foaf/0.1/name> would match triples having the topic as a subject.

^foaf:member matches triples having the topic resource as the object.

foaf:member/foaf:name matches all the triples having the topic as the subject of a foaf:member triple AND where the object matches the subject of a foaf:name triple (and get that triple too). If wanted all foaf:member triples regardless of the foaf:name you can use foaf:member|foaf:member/foaf:name

You can use ! to negate a set of predicates ie. !(foaf:name|foaf:mbox) gives all triples with the topic as a subject EXCEPT foaf:name or foaf:mbox triples.

You can use regexp style +, ?, * and {n,m} to indicate that a path should be repeated. However I’ve cheated a bit here so it’s not unlimited. By default it repeats things up to 8 times so + is treated as {1,8}. That’s a bit of a cheat but it works and you can change the depth.

The additions I’ve made is “.” to match any predicate. So a path of “.”

The call will be something like this:

$graph = new Graphite();
$graph->ns( “sr”, “http://data.ordnancesurvey.co.uk/ontology/spatialrelations/” );
$thing = $graph->resource( “http://id.southampton.ac.uk/building/32” );
$endpoint = “http://sparql.data.southampton.ac.uk/” );
$n = $thing->loadSPARQLPath( $endpoint, “.|(./(rdfs:label|a))|(^sr:within)+/(rdfs:label|a)?” );

$n is set to the number of triples returned.

What this actually does is returns:

all the triples with <…/building/32> as the subject, their label and type, plus all the things within the building, or within things within the building etc. and their label and type, if any.

OK. The above query is a bit much and takes about 5 seconds to return. Turning the + on within into a {1,3} cuts it to a more reasonable 0.8 seconds.

This is just the first cut and I’m posting it here for suggestions. It’s not yet ready for integration into the core library but if you like it, have suggestions or other requests for Graphite v2.0 please leave a comment.

If you would like to help out, we could really do with someone running some kind of user community, but I just don’t have the time.

Update: Try a live demo!

Posted in Graphite.

No comments

By Christopher Gutteridge – June 18, 2013

Research Data Onion

We’ve been thinking a lot about research data and how to manage it, how to open it, how to share it, and how to get more value from it without making too much extra work. For some time I’ve been considering two different ways to think about research data.

The Onion Diagram

The first is this diagram, which shows the various layers of metadata which I see as surrounding a research dataset.

Research Data Onion

Some of these layers are less obvious than others. The important thing is that each layer is created at a different time and process, has a different purpose and different people are responsible for it.

Often these layers are merged into a single database record, but it’s useful to think about them as distinct layers when looking at how to manage them.

As I’m most familiar with EPrints, I’ve included examples of where this information would be handled in that software if it was being used as a repository + catalogue for research datasets.

1. Research Dataset

This is the actual dataset produced as part of a research activity. It may be tiny or huge. It may or may not be available from a URL. In rare cases it may not be digital (a hand written log book of results). It might be the weird file format that the lasercryomagnoscopeatron produces, an excel spreadsheet or a hundred XML files. It may make sense to only a few people or tools.

If you are using EPrints as a dataset store+catalogue, this would be a document attached to an EPrints record.

2. Subject-specific Metadata

This will be provided by the researcher, and research communities will need to decide what goes in this metadata. Librarians can certainly advise and assist but the buck stops with the researchers. This layer provides the research context for the dataset, it may include information about the processes used, the type and configuration of equipment. Long term, I expect equipment manufacturers to be able to create much of this and output it with the raw dataset, similar to how modern digital cameras embed EXIF data in the JPEG images they create.

This might be as simple as a text description of anything you might need to know before working with the data, such as assumptions made, or the sample size etc, but I suspect we’ll see some fields start to standardise what metadata should be provided for certain types of experiment.

In a subject-specific archive this metadata may be merged with the other layers of metadata but in a institutional repository there will be all kinds of weird and wacky datasets so its important that the people running the data catalogue are not proscriptive about this metadata, although a subject specific harvester may make some rules about what it should contain.

If you are using EPrints, this data would be stored in a supplimentary document attached to the record. A few years back we added a “metadata” document format for exactly this purpose.

I would expect that, in time, subject specific tools would harvest this data from multiple sources and give subject-specific search and analysis tools which would be beyond the scope of the university repository, but easy to implement on a big pile of similar scientific metadata records from many institutions, eg. a chemical research metadata aggregator could add a search by element (gold, lead..) which would be beyond the scope of the front end of the archive where the dataset is held.

3. Bibliographic Metadata

Here we get to most people’s comfort zone. This is the realm of good ol’ Dublin Core. This describes the non-scientific context of the dataset: who created it, what parts of what organisations were involved, when, where and who owns it and what the license is.

With my “equipment data” hat on, this seems like the layer which associates the dataset with the the physical bit of equipment (eg. http://id.southampton.ac.uk/equipment/E0007), the facility, the research group, funder. Stuff like that. Things which the library and management care about, but don’t really matter to Science, unless you are evaluating how much confidence you have in the researchers.

In EPrints this is the metadata which is configurable by the site administrator and entered by the depositor or editor.

4. Data Catalogue Record Metadata

This is any data about the database record. Most of this will be collected automatically. It’s often in the same database table as the bibliographic data but it’s not quite the same thing.

This layer of the onion is stuff like who created the database record, when, what versions there have been. This can generally be created automatically by the repository/catalogue software.

In EPrints this is the fields which the system creates automatically.

This is generally merged with the bibliographic data layer unless you are doing some serious version control, but it is a distinct layer of metadata.

5. Catalogue Metadata

These last two layers are not really considered most of the time, but if we want things to be discoverable and verifiable it’s helpful to quantifiable.

This is the layer of metadata about the data catalogue itself. Not all data catalogues actuallycontain the the dataset, they may have got the record from another catalogue.

Anyhow, this layer tells you about what the catalogue contains, broadly, and the policies and ownership of the catalogue itself.

In EPrints this would be the repository configuration such as contact email, repository name, plus the fields which describes policy and licenses which many people don’t ever bother to fill in. You can see this data via the OAI-PMH Identify method.

6. Organisation Metadata

This is something which nobody has given that much thought to yet, but for data.ac.uk we’ve proposed that UK universities should create a simple RDF document describing their organisation, with links to key datasets, such as a research dataset catalogue, and other datasets which may be useful to automatically discover. This allows the repository to be marked as having an official relationship to the organisation. Some more information is available from the equipment data FAQ.

Peeling the Onion

The last step is to make the Organisation Profile Document (layer 6) auto discoverable, given the organisation homepage. This means you can verify that an individual dataset is actually in a record, in a repository, which is formally recognised by the organisation (as oppose to set up by a stray 3rd year doing a project, or a service with test data etc). Creating and curating these layers provides auto-discovery and probity in a very straight forward manner.

Posted in Research Data.

1 comment

By Christopher Gutteridge – May 1, 2013

FatFree, RedBean and FloraForm – A light and flexible web framework

In December Web Team spent some time playing with web frameworks. My previous framework experience is with Django which I highly recommend but that was not appropriate here because Python is not one of iSolutions supported web languages. As a result we spent some time researching PHP frameworks. PHP is a bit of a hodge podge and PHP frameworks are no exception, there are a lot of different options.

Zend – Zend framework seems to be a everyone’s go to framework. I found it quite large and actually difficult to get set up and running meaning it was non starter. We do agile here if it takes more than 5 minutes to get started then you’ve missed the boat. It is very full featured and has big user base but seems to be aimed at “enterprise” which is a word I usually replace with “over complicated”.
Cake – A bit easier to get started with than Zend and still fairly comprehensive. The Object Relational Mapper felt a bit backwards but it is popular and good community support.
FuelPHP – I Invested a fair bit of time into this. Very cool, lots of nice features, good ORM. A bit complicated and the documentation and user community was a little new. I complained about the documentation being a bit lacking in places and they fixed it but I still wasn’t confident enough to choose it as solution.
FatFree – Really pleased with this and chose it. The reasons are discussed below.

So why FatFree? Zero to writing code in less than 5 minutes. There are very few files and you can throw away the bits you don’t want to use. The documentation is good and Googling for problems gets solutions. Now the for real reasons. The other frameworks I looked at all require you to work inside them. We have a huge web presence which has been grown over 15 years rather than designed. FatFree let me use my old PHP and pepper it with FatFree. Over time more of the code we write will be converted to FatFree but we didn’t have to do a huge big bang move. Being able to gradually improve our existing stuff was important. Also FatFree is not trying to do EVERYTHING and as a result it is built to work with code which was not really design to integrate with FatFree . The best example is the ORM. FatFree provides a very basic Object Relational Mapper (takes php objects and stores in database). This would be a weakness if it was harder to integrate other libraries into FatFree. Enter RedBean PHP.

RedBean is the best object relational mapper I have ever used, absolutely no question about it. When prototyping an app in FuelPHP I had to know exactly what I wanted the database to look like at the start. RedBean lets you completely change that around exactly as you see fit while you program and just works out how to store it all in the database. There are a few nuances, aliasing had me completely stumped for about 20 minutes, but it’s easy when you’ve cracked it. The one thing missing was a slick way to take user input but FatFree’s flexible design enabled us to use the FloraForms library Chris had written.

FloraForm lets you easily construct a form, parse input and customize validation. There is still a bit of work required to make this a really reusable but it has become part of our core tool set so expect work in this area. The thing which made FloraForm the ideal addition to this little toolkit is it returns all of your form inputs in a big PHP hash. From this hash a 10 line function serializes the data into RedBean objects and similar 10 line function de-serializes it back into the form. The result is constructing a FloraForm interface builds your database tables and stores your data. This is a very fast and powerful combo. For simple systems you will need to do no further work and you can prototype complicated systems very fast and allowing you do make your radical design overhauls with very little effort. This is ideal for the ever changing goal posts of real world development with limited staff time.

One final note about FatFree worth mentioning is that it allowed members of the team which have not done framework development before to gently transition into frameworks. This may not sound significant but in a busy working environment having to completely overhaul your working practices in one go is very painful and time consuming. Day one of FatFree you can just use the router and do everything else as normal. After that maybe you will experiment with templates. Next time you build a database have a play with RedBean. Before you know where you are you are a full-blown framework developer without the upheaval of having to learn to do your job from scratch.

Posted in Best Practice, Open Source, PHP, Templates, Uncategorized, web management.

Tagged with Databases, FatFree, FloraForm, Frameworks, RedBean, Templating.

No comments

By Patrick McSweeney – March 28, 2013

« Previous Next »

You can’t use ORCID without allowing Google Analytics to watch you

It’s really really not politics-proof

Number of the Beast?

How does this make science & research better?

TLDR (excutive summary)

About me

A standard non-UK-specific vocabulary

Open Data users and the other 99.99% of citizens

Eat your own dog food!

Authoritative Identifiers make everything better

First they cancelled Firefly, then they cancelled Dev8D, what next? Bacon?

Who moved my Cheese?

Idea 1: Accept that innovation is part of the costs of running a university

Idea 2: Split too-big-to-fail projects, into sub projects

Cold dead eyes?

Looking fundable

Rebranding

Regeneration

A bit of background about RDF literals

Other dialects

Option 0: Ignore the problem or enforce a national standard

Option 1: Separate label datasets

Option 2: Invent datatypes for these dialects

Option 3: Invent our own language codes

Option 4: Model it in RDF

Option 5: Use a predicate for each dialect

Which option?

Sysprog vs Developer

Basepath

Perl

PHP

BASH

Others?

Secrets and local stuffs

Hostnames

URLs

bin/ directory style

#!/usr/bin/env xyz

Other tips?

A big problem with open data…

Local caches of the data dump

Google Analytics can’t help you here

URI resolution patterns

Quantity vs quality

Uncertenty

Email Triage

Dial “T” for “TODO”

Vacation Message

Delegation

Ventnor Fringe

Volunteer

The Onion Diagram

1. Research Dataset

2. Subject-specific Metadata

3. Bibliographic Metadata

4. Data Catalogue Record Metadata

5. Catalogue Metadata

6. Organisation Metadata

Peeling the Onion

Subscribe

Authors

Recent Posts

Meta

Blogroll

Tags