I was recently sent to attend the Scholarly Infrastructure Technical Summit (SITS) by JISC, along with Ian Mulvany from Mendeley.
The goal of each SITS meeting (as I see it) is to get a technical experts (a mix of developers, and project managers who understand tech) together to talk about their experiences/problems/successes with various scholarly infrastructure tools or components.
Something which worked extremely well (which was new to me) was having the meeting run according to an Open Agenda. Topics of interest were brought up by participants, and then voted upon to ensure that there was sufficient interest in a specific subject area.
The meeting started with a run through of topics that were raised at the previous SITS. These included:
SWORD, reverse SWORD, common tooling for workflows, storage abstraction layers for repositories and author identification, as well as discussions around appropriate citations for digital objects.
The topics which were raised and eventually discussed this time around were:
- Authentication/Authorisation
- RDF/Linked Data
- Web Archiving
- People/Author Identifiers
- Microservices
- Curriculum/Training Development
- Search Engine Optimisation (SEO)
- Lightweight Languages
Authentication/Authorisation
Discussions here were focused around authorisation and authentication, both across services within an institution, and between institutions.
Shibboleth was (as expected) the primary technology talked about, followed by OAuth, and then OpenID.
I was hoping to hear of some success stories here, but mostly it was the problems and questions about these systems which came through:
- How can we combine Shib with IP or key based auth?
- How can we provision temporary/guest accounts in such a system?
- How can we trust remote credentials?
- How can services authenticate between each other?
The issues with service to service auth were mostly based around the fact they’d require extra client libraries to be used (especially for web services, where basic access authentication remains the easiest method to use).
Another major issue was that of management of access to resources. If centrally managed, how can a data/service provider be confident that their institution is keeping their groups/users/access levels up to date and correct? And more importantly, how can we be confident that an external organisation who has access to our systems is doing the same?
Related to this issue is that of spreading access control of data around (Chris Gutteridge also has a blog post related to this in linked data). Most major auth systems seem to centralise control – but what happens when an individual wishes to share some of their personal data with another person/group/service? What happens if a department wishes to have their own policy controls? What models are there for delegating control, whilst ensuring overall stability and security of a system as a whole?
I think the most interesting outcome from this topic was discussion about a shim or meta-auth layer that could sit behind several auth systems. There certainly seemed to a be a lot of interest in something which could authenticate against Shib/OAuth/OpenID/anything else, and then provide a single set of auth details to an institutional system.
It would mean that institutional software could interface with this one layer, and have additional auth mechanisms added to it through extensions/plugins, rather than having to plug multiple auth systems directly in, and have to update code each time a new auth system comes along.
Mendeley do this internally, but if there’s an open source solution for this out there we don’t know about, I’m sure it would be very popular…
RDF/Linked Data
The discussions around linked data were very similar to several I’d had with people over here in the UK – the barriers to adoption were the same at least:
- Which vocabularies should we be using?
- Should we be creating our own?
- Where can we see examples of best practice?
- Whose identifiers should we use?
- What content should we be making available first?
Seems it would be good to include some US institutions in the discussions that are happening around the UK academic community at the moment, at the very least to prevent us from going off in entirely different directions!
Lack of tooling was also perceived as a significant barrier to entry. There was a lot of interest in access to linked data using RESTful APIs (e.g. the linked-data-api), and using javascript and JSON to consume the data. These allow experienced developers to consume RDF using methods/technologies they’re already familiar with.
During this discussion (and during a couple of others), quite a few people expressed dissatisfaction with Dublin Core as a means of describing repository data. It seems that some (including people at Google) were interested in looking at HighWire as an alternative. I know next to nothing about this though (a search doesn’t reveal much either…), but will update should I find out more.
Web Archiving
Next up was the topic of archiving web resources, which followed on nicely from a presentation on recent developments in Memento at the DLF Fall Forum.
A key point here was the issue of when/what to archive. Some web resources (a paper in a repository for example) have fairly well defined versions – but do we want to archive every single one?
Other resources (a page aggregating 3rd party content for example) won’t have such well defined versions, and will need to either be archived periodically, or by constantly watching them for changes.
Assuming we’ve taken care of the above, the next point of discussion was about searching. What sort of interface will be needed to search historical resources? Obviously a user won’t want to be presented with a dozen search results containing almost identical content. They’re also likely to want to browse back and forth through important versions of an item onc they’re found a document they’re after.
Someone also raised the point of the difference between browsing/searching historic documents vs. browsing/searching as if you were on a historical system. Some users will want to search using historical indexes, others will just want historical results.
There was certainly interest in Memento as an easy to implement strategy for archiving some web content now (adding it to repositories would be an easy win in this regard), and then worrying about what to do with the data later. It also seems to have the advantage of working below the normal web application level, meaning that the same technology can be used for archiving video/images/RDF/html, without requiring an application specific setup each time.
It was also mentioned that JISC would be commissioning some large scale work on the preservation of fast moving resources.
People/Author Identifiers
The focus of this topic touched on a lot of things, but mostly revolved around ORCID (Open Researcher & Contributor ID).
The ORCID initiative aims to provide a registry of authors/contributors (to aid in communication, author disambiguation), which can then be linked to other ID schemes, to publications, or to each other.
ORCID is (I believe) a follow-on from Thompson Reuters’ ResearcherID. ResearcherID required self registration though, which is where it’s believed to have failed (they had <20000 individuals register). ORCID’s aim is to get author information form institutions, rather than individuals.
I was surprised to see there had been so much interest in this already, 300-400 institutions have already registered their interest in it. It seems that some journals may start requiring ORCID IDs before publishing, which could well be a driver in this.
This, along with the fact that it should work nicely with other ID systems makes it look like something worth keeping an eye on.
It’s not without its issues and potential problems though.
The first of these is that the information kept by ORCID hasn’t been finalised yet. What should they store along with an author’s ID and name(s)? Publication list? Grants? If so, who’s going be responsible for maintaining it?
This also raises the issue of control of personal data. If an institution makes a statement about you in ORCID, do you have the right to retract it? What about if an individual starts making statements an institutions knows are untrue?
Storing the provenance of each fact about an individual in ORCHID seemed to be the accepted solution for this – it would leave it up the data consumer to trust individually/institutionally submitted facts about a person.
By far the biggest obstacle seems to be the lack of an ongoing business model for ORCID though. Once it’s up and running, and as it’s a centralised service, how should it be funded? The identifiers it mints will need to be permanently resolvable in order for anyone to trust it and use it as a service, so how can the community guarantee this?
The project is still very much looking for contributors to develop things from an institutional side though, looks like there may be several potential projects here.
Microservices
This is a term I hadn’t heard before this month, but came up a lot at the DLF Forum, as well as being a topic with significant interest at the SITS meeting.
Luckily it wasn’t just me who was unsure of exactly what it meant – it seems to be a bit of a buzzword, taken to mean different things by different people. A rough summary can be found on the iRODS micro-services page. The CDL (California Digital Library) also have their own take on microservices on their Curation Micro-Services pages.
The basic gist of microservices seems to be this: Repository software is made up of collections of services. So why not separate them out, and make each one available for reuse, either as a web service (SOAP/REST), or on the command line?
This mirrors the Unix philosophy of making programs do one thing well, and making others by combining several of these.
They allow for easily interchangeable components with a narrow focus, allowing for complex services to be built up without reinventing the wheel each time (especially when moving between languages or platforms).
Some examples of microservices might include:
- image resizing
- file checksum calculation
- file hashing service
- send email
- object storage
I’m not convinced the specs for these are well defined enough for general purpose use yet, but I can see the technique in general being very useful. If the same microservices can be called through multiple interfaces (command line, REST, etc.), then it should in theory make them language agnostic.
Curriculum/Training Development
Things moved to a slightly less technical theme here, the focus being on how to get new staff/project members quickly up to speed in a development or project environment.
The key point was to work out how best to ensure that new team/project members gain the skills they need to get their work done.
This includes a mix of technical and non-technical skills, the exact nature of which will vary depending on the project:
- Source code management (git/svn) and committing guidelines
- Documentation guides (how to document code/software)
- Code style guides (what should my functions be named? should I indent with spaces or tabs?)
- Unit testing (which framework should I use? when should I write them?)
There are also of project/platform specific things an individual will need to know:
- How do I add plugins to this repository?
- How do I code X in language Y?
A key point raised here was in the difference between people who were primarily librarians and those who were primarily developers. How do we get each up to speed in the areas they lack? A single curriculum for everyone probably wouldn’t suffice.
Additionally, how should this be taught, and who by? Online notes? Or taught as part of a library/computer science course?
Making sure that the right people attend the right training courses/workshops was also a key point mentioned. Ensuring that an individual has the prerequisites necessary to participate is essential in order not to waste time and money.
Some suggestions about starting points for developers/managers looking into this included the following:
- code4lib wiki – a guide for the perplexed
- Producing Open Source Software – a free (e)book
- Making Things Happen – an O’Reilly book on project management
- Joel on Software – a blog on software development
Search Engine Optimisation
This topic should really have been called “Google Scholar Optimisation”, such is the perceived importance of Scholar in the library/repository world (Scholar is apparently way ahead of Web of Science as a student portal to research for example).
There was a great deal of dissatisfaction expressed with way Scholar works, primarily concerning the fact that it’s the Scholar team who are dictating the metadata that institutions produce in order to be included in their index (more info in Google Scholar’s Inclusion Guidelines).
It was largely felt throughout the room that it should be the academic community who is responsible for agreeing upon a standard for exposing metadata (RDFa was mentioned here), rather than being forced to adopt 3rd party’s schema which doesn’t fit their data very well.
One thing I learnt from this was that the Scholar team is separate from the regular Google search team, and the technologies they use to index/harvest differ. This means that institutions have to produce metadata multiple times: for Scholar, for regular google search (and probably more for additional harvesters).
So what are the solutions/workarounds? Some ideas raised were to:
- Agree on RDF(a) standards for presenting metadata
- Contact NISO about developing a standard
- Approach a rival to Scholar (e.g. Microsoft Academic Search)
- Use the collective bargaining power of institutions to effect a change at Scholar
I’ll finish by saying that quite a lot of people in the room felt very strongly about this, it didn’t sound like Scholar was making anyone very happy!
Lightweight Languages
The main gist of this topic was discussion about barriers to introducing a new language into a project or environment.
Ruby was the main language being discussed, but the discussion could just as easily apply to any language/technology being considered for use.
The key word here was “misconceptions”, seemingly from all sides where introduction of a new technology is concerned.
One barrier to adoption was seen to come from developers themselves. Many are reluctant to learn a new language (especially one perceived as too hard/different to the ones they know). This could be especially relevant to non-CS developers – they’re more likely to have language specific experience, rather than abstract programming knowledge, making it harder to switch.
The next was from a sysadmin point of view. The introduction of new languages can be seen as a security risk, and as yet another set of software that needs patching/updating/configuring. Different languages also have very different security models that need looking at, PHP’s now deprecated safe mode and Java’s Security Technology are a couple of examples which are very different indeed.
There’s also a certain level of suspicion (speaking from my own experiences here too…) about whether new languages/technology are really required for a project. If a project manager has a good reason for picking a certain language (e.g. they have to use a specific software package, libraries/plugins in a certain language are good, a language is needed to easily interface with other tools, etc.), then that’s one thing. There are just as many less valid requests to use a language though (e.g. it’s all I know, it’s a current buzzword, etc.).
So how do we get around these issues?
Virtualisation or bundling of the language was one solution mooted. Using virtual machines is one example of how this could be done (though it still raises many questions about security and trust). The other example given was in bundling up the language and libraries in a single package that could run in a more sandboxed environment (the Ruby language bundled in a WAR file running on a JVM was given as a successful application of this technique).
More important than this though, was the idea of getting sysadmins involved early. Rather than going to them with your requirements, it seems that teams had much more success by involving them with developers from the start, getting them on board to discuss issues and solutions, rather than dictating them at a later date.
Closing Thoughts
Overall, I thought the meeting was a big success, and opened my eyes to quite a few big things that I wasn’t aware of before (ORCID, microservices and Google Scholar issues being foremost among them).
More importantly, each topic mentioned above was finished with some action items, so I’m hoping that we hear some progress on these from various SITS attendees in the near future (I’ll add links to this post as I hear about them).
Future SITS meetings will take place in different locations, and with different attendees, so I’m hoping we’ll see a good cross-section of issues and experiences coming from the meetings. I’m sure we’ll see a lot of common threads come up that lead to more less repetition of software development, and more importantly, less repetition of mistakes!
Dave Challis
dsc@ecs.soton.ac.uk
More on Highwire tags would be interesting. I’m not even sure where they’re documented