When the partners in the CREAM project came together in February 2015, during the first Jisc Sandpit, we realised that our collective experience of research processes included instances of the active use of metadata. However, we acknowledged that we needed first to learn how to recognise the factors that characterise active use within our own narratives, and then to investigate how our insights could be applied in other domains.
During Phase 1 we collated examples and experiences from all the partners and deployed Annalist with sets of representative data records. A key aspect of our methodology was to consider the feasibility of defining a core model and investigate whether existing schema or vocabularies could capture core elements. We anticipated that by examining and modelling workflows from a number of research domains we would be able to identify common kinds of data element that are used actively in a variety of situations. Unsurprisingly, we did notice some very provenance-like elements in some of the workflows we examined, but these were largely post hoc recordings of what happened rather than highlighting active use of metadata.
However, it became apparent that metadata being used activity is very much domain-specific, so during Phase 2 we adapted our methodology to characterise how we use metadata actively. We found that our active use of metadata was substantially implicit rather than explicit in the workflows we examined, and that we would need to delve more deeply to uncover the “active metadata” in use. This idea is explored a little further in the Carol and Martin dialogue/
The individual sections in this project report describe briefly the outcome of analysing the representative data records for each of the examples provided by the CREAM partners. Analysis with Annalist is still in progress for a couple of other examples, but the results are not yet available.
The Annalist sub-domain for CREAM is cream.annalist.net and will be used for all current and future documentation related to the deployment of Annalist for the CREAM project.
This is a small example based on the GeosMeta architecture. It is quite abstract in nature, as at the time there was no concrete GeosMeta data available for anything more concrete. Based on information provided, the data was modelled substantially using PROV, with the addition of a status indicator associated with PROV activities.
The example shows a segment of a provenance trace from entity `7_TX` derived from entities `4_TX` and `5_TX` by an identified activity. Follow links from http://cream.annalist.net/annalist/c/GeosMeta_example/d/Entity/7_TX/
The data as provided don’t explicitly indicate active use of metadata, but discussions with the GeosMeta team suggest that the Activity status indicator (e.g. “Downstream of failure” in http://cream.annalist.net/annalist/c/GeosMeta_example/d/Activity/5538c92daf15b55bd33557b1/) would be used to decide on follow-up processing (or rejection) of results, so that could be characterized as active metadata.
LabTrove Sortase Example
This example contains information extracted from LabTrove (https://blogs.chem.soton.ac.uk/sortase_cloning). The entries on which the examples are based were chosen because they appeared to make more effective use of LabTrove metadata than other entries. A range of example entries was chosen to cover the different kinds of metadata found in the lab trove entries for this experiment.
We see in this data that the metadata provided is entirely about details of the samples, instruments, observations, etc. There is no explicit process metadata provided as explicit LabTrove annotations: the process description here is entirely implicit in the descriptive blog text, so it’s hard to discern any active use of the metadata provided. This suggests that the active metadata use may be tacit rather than explicit, something that is drawn out in our wider discussion of active metadata.
This example is based on an entry from the eCrystals database. We see here a lot of specific details about the sample and its analysis, but no process description or breakdown that allows us to discern active use of metadata. Again, this suggests that the active metadata use may be tacit rather than explicit.
The iterative/refinement part of the analysis we do is not captured here – partly through culture and partly through lack of tools (in the past) to do so. Essentially what is shown in the ORE representation is the beginning and /end of the process. There are typically about 10 cycles around the .res and .ins files in the process where parameters are tweaked or turned on. Decisions are very much made as to whether the tweaking has made the model better or worse. We have a pretty good idea of the metadata that would be recorded at each tweaking stage – although there is rather a grey boundary between what is a parameter and what is metadata (which is likely to be a common problem). The next generation of software, written subsequently to the capture of all the records in eCrystals, has incorporated the notion of a ‘history’ which records the workflow so that you can roll back during the process (based on some metadata indicators) to any fork / decision making point.
This example shows a short section of an activity trace, modelled using the PROV ontology. The original data came from a spreadsheet, but there was a clear indication there that the rows were successive operations on a file, so mapping this to the PROV model was very straightforward. (URIs for the intermediate steps are based on the proposed duri: scheme, thus avoiding the problem that PROV does not itself describe changes applied to a single identified resource; the `prov:specializationOf` properties relate these back to the actual resource that is being changed.)
We could not identify any explicit instances of active use in the metadata provided, but the Activities in this trace carry a “plan detail”, which is an XML representation of the InkScape command corresponding to the editing steps represented (e.g. see http://cream.annalist.net/annalist/c/artivity_example/d/Activity/20150602T143208/). The details of this representation are somewhat opaque, but it is reasonable to suppose that active use of metadata might be encoded here.
This example is drawn from “auto-reduction” processing of raw data from ISIS neutron source (high energy physics) experiments. It represents a small part of a much larger processing pipeline associated with these experiments. The explicit metadata here are parameters associated with auto-reduction activities, and some activity status information (e.g. http://cream.annalist.net/annalist/c/ISIS_Autoreduction/d/Job/Job_20580/).
While there is no explicit process breakdown here, each record does represent an auto reduction processing activity, and the organisation of the data does carry implicit information about repetition of auto-reduction activities until a suitable result is obtained, which constitutes an active use of the job status. For example, see the sequence:
The brief descriptive comments are suggesting parameter and configuration changes applied until a successful run is obtained. Discussions with STFC indicate that this is a very simple example of processing decisions based on results obtained, and that further along the processing pipeline there will usually be decisions made based on a much wider range of values than just the job status value.
Creation of two audio-visual pieces, Smoke and Fog
http://cream.annalist.net/annalist/c/IG_Philadelphia_Project/ and http://cream.annalist.net/annalist/c/IG_Fog/d/
This example includes a complete description of the process of creating two multimodal (image/sound/video) artworks by Iris Garrelfs as part of this project. The primary data is in the form of two journals consisting mainly of descriptive text and resource links. This is then overlaid with a description of the process of creation using a formalisation of the Procedural Blending model set out in Iris’ PhD thesis (http://irisgarrelfs.com/thesis); e.g. see http://cream.annalist.net/annalist/c/IG_Philadelphia_Project/l/BlendInputs/.
While there is not (yet) much in the way of formally identified metadata here that can be seen as actively used, the “Inputs” serve a similar role, and working with the procedural blending model has given rise to insights about where the active use are occurring. In this model, the “Blend Nodes” are particular activities where key decisions are made about how the available inputs are combined, creating outputs that are passed on to subsequent steps of the overall process.
With regards to re-use, Fog was based on Smoke and a blend diagram correlating the two projects shows where metadata has been re-used. http://cream.annalist.net/annalist/c/IG_Fog/d/Image/Smoke_Fog_PB_Correlation.
This example is based on a mixture of personas developed to explore the experimental activities of a variety of researchers in chemistry and a possible set of records describing the researchers, organisations, projects, and the resources and procedures involved in carrying out and recording experiments. The experiment examples are based on real experiment records from a number of sources.
The example is nowhere near fully complete, but creates a variety of linked records using some initial data captured in Annalist. The result shows that Annalist can be used to create custom interfaces for recording experimental information, plans for experiments, and data about the resources and context of the experiment. An experiment record, for example see: http://cream.annalist.net/annalist/c/Chemistry_Personas/d/type_Experiment/experiment_Synth001/, includes a variety of metadata including:
- Equipment and instruments, and associated settings
- Weights and measures, temperatures, pressures, timings, and other information about experimental conditions
- Hazards and safety
- Results data and settings related to the data analysis
In addition, the experiment record will usually contain information about the procedure followed, observations made, and may contain information about unexpected events, decisions made and the conclusions of the results. The metadata described above may be captured in a structured format, as provided in the example shown, but much of the ‘active metadata’, particular the specific values and metadata that changes as the experiment is run may be captured in narrative form that is more difficult to extract from the experiment record.
Experiments typically start with some kind of plan – this may include basic information such as the materials and safety information, typically required to be provided before the researcher can start work, with limited detail of the procedure, to a much more rigorous plan for the procedure to be used. In the example shown previously, the plan contains many details of the experiment to be undertaken: http://cream.annalist.net/annalist/c/Chemistry_Personas/d/type_Plan/plan_Synth_01/. The experiment record itself contains the intended plan, but differences from the plan are recorded, such as the specific weights, measures, temperatures, and timings used and changes from the original procedure. These changes may represent ‘the active use of metadata’. The plan itself may be repeated many times, with differences in the specific conditions and material batches leading to different results that can be explored and investigated in further experiments. Materials, see http://cream.annalist.net/annalist/c/Chemistry_Personas/l/type_Material/# for examples, have many properties that can be considered metadata that may be actively used. As well as chemical properties and safety information, information could be added about the batch number, a use-by date, the manufacturer, details of who, how, and where a sample was prepared, and other properties such as structure, strain or form.