Skip to content


Repository file type analysis for educational repositories

EdShare logo

One of the characteristics that we have recognised in working with educational repositories is the diversity of the file types content creators work with in their development of educational resources.

“It would be unrealistic, even positively unhelpful for an educational repository to limit by file types the kind of resources they are prepared to accept.”

While the dominant institutional research repositories may content themselves with accommodating PDFs, Word documents, and now, increasingly, LaTeX files, and other subject specialist repositories may only accept .jpg or .cif files, it would be unrealistic, even positively unhelpful for an educational repository to limit by file types the kind of resources they are prepared to accept. The reality that we have accepted in EdShare is that we need to be responsive and supportive of our whole user community, and that there may well be significant preservation challenges in accepting this situation.

Following a recent upgrade of the repository software to EPrints v3.2, a version that includes preservation and file format identification tools, we were able to produce an accurate profile of the file formats stored in EdShare:

EdShare format profile by file count (14 Oct. 2010)

EdShare format profile by file count, top 16 formats, file count > 150 files, 49 'other' formats aggregated for clarity of chart presentation (14 Oct. 2010)

Although not shown here, a breakdown of the repository contents by file size would show a different picture. The largest format by file count among the ‘others’ is MPEG 1/2 Audio Layer 3 (MPEG-4 Media File and MPEG-1 Video Format are also in this aggregate). It is anticipated these formats could each account for 10-20% of file storage space, quite a different picture to that presented above.

The ‘unknown’ files have a format that is either unknown, or known but not recognised, to the format identification tool. These are critical files in a preservation risk analysis, until they can be identified. The issue of unknown files will be considered further in a forthcoming blog post comparing format profiles among all the KeepIt exemplar repositories.

EdShare format profile: what does it mean?

From this data, we are left with lots of questions:

Is there likely to be any change over time in the composition of these proportions and ratios? What is the relationship between the deposit level for specific file types and the usage levels of file types – do these functions map simply one to the other or is highest usage associated with a relatively small number of file types? We will undertake to generate time slice data for EdShare content over the next period so that we can develop trend analysis and explore more fully what the statistics can tell us. This work will be made much more straightforward with the upgrade to version 3.2 of the EPrints repository software.

We will also undertake some work on exploring the institutional and individual’s expectations of EdShare and what our service can/should offer. So, do all users (and the University) expect and require unbroken access at all times to all resources? If so, this would be a level of service provision unprecedented across the portfolio of systems which support and deliver education across the institution. We will explore the level of service delivery procured by the central IT service for the University by comparable systems and chart our own provision. This may involve unbroken access for certain groups of staff (teachers and educational administrators) or unbroken access for resources with certain characteristics. In the new areas that EdShare is charting, this may be a difficult thing to clarify and articulate in the first instance.

Where we have identified the different format types in EdShare, both in terms of the number of files and file size, we need to understand what exactly these resources are. We will do some work to explore their storage costs, over the short, medium and longer term as well as the bandwidth costs in delivering these resources to the end user. This data, together with information about the risk levels associated with the file types themselves will support more detailed understanding about the materials we are working with and the expectations and requirements of our user community.

Comparing other sources of educational materials

In addition, as a complementary piece of work to this, we interrogated the database of the institutional VLE (Blackboard) as to the statistics of the range of file types which are held in its database and have accumulated over 10 years’ use. The data produced from this repository shows a distribution like this:

DOCS – 20%
GIF – 20%
PDF – 11%
JPEG – 8%
PPT – 8%
HTM – 5.3%
XML – 4.4%
DAT – 4%
SWF – 2.7%
MP3 – 1%
Others – 15.6%

This data has an extremely long tail, of a very large number of file types.

As I stated in a blog post here one year ago: “EdShare provides a safe, secure and persistent URL for the learning and teaching resources of the whole University community deposited in its shares. This EdShare approach requires that we engage actively with the issues of digital preservation which are particularly relevant to the growing community for learning and teaching repositories.” To deliver on these words, we may need to clarify exactly what we are committing the repository to and define our “offer” more precisely.

“Most repositories were simply not able to provide the format information that we requested – they lacked the functionality required”

In preparation for the next stage of this work, we decided to consult more widely across the educational repository community in order to establish how representative (or otherwise) EdShare is in terms of the diversity of file types we currently host. I contacted known repository managers from a cross-section of different types of repositories – institutional repositories, subject repositories and national services. We received a range of responses: most repositories actually were simply not in a position to be able to provide the information that we requested – they lacked the functionality required or hadn’t created a report which they could easily run to discover the range of file types they hosted in their repository. Where we did manage to obtain some data, though, it looks like this:

Repository A – national, Open Educational Resources Collection
JPEG – 39%
Links – 39%
HTML – 2%
PPT – 2%
Unknown – 1.5%
Others – 16.5%

Repository B – small, teaching-focused institution, collection of educational resources
JPEG – 67%
PDF – 14%
FLV – 10%
MP3 – 7%
Others – 2%

Interestingly, although having quite different rationales, both of these cases share the same predominant file type as EdShare and also share the same characteristic observed in other repositories that in any community, for any kind of repository, there will typically be a single predominant file type that people like and this will also be associated with a long tail of a mixed and varied range of file types which pose all kinds of challenges.

“there is some consistency and indications of general trends”

We acknowledge that there is only a very limited amount of data here that we have managed to gather. However, from what data we do have, there is remarkable consistency and indications of general trends. Certainly, we have some basis on which to focus on the preservation issues relating to a relatively small number of file types and some indications of where it would be useful to investigate further – in terms of highest-use file types rather than necessarily the largest number or the highest level of storage space. Over the coming months we intend to work on this area further and understand how representative EdShare is of the small but growing community of educational repositories.

Posted in Uncategorized.

Tagged with , , , , , , .

0 Responses

Stay in touch with the conversation, subscribe to the RSS feed for comments on this post.

Some HTML is OK

or, reply to this post via trackback.