File Formats I Have Used to Deposit Items in the Bath Institutional Repository

What file formats should you use to deposit papers in your institutional repository?  Although I recently suggested that RSS could have a role to play in allowing the contents of a repository to be syndicated in other environments  that post didn’t address the question of the preferred file format(s) for mainstream resources such as peer-reviewed papers.

For my papers in the University of Bath Opus repository I initially normally deposited the original MS Word and the PDF version which is normally submitted to the journal or conference: the MS Word file is the original source material which is needed for preservation purposes and the PDF file is the open standard version which should be more resilient to software changes than the MS Word format.

What I hadn’t done, though, was to deposit a HTML version of my papers, despite that fact that I normally create such files.  I think I suspected that uploading HTML files into a repository might be somewhat complicated so when I uploaded my papers I omitted the HTML versions of the papers.

Problems With PDFs

PDF cover page for a paper in the Opus repositoryHowever when I recently viewed the repository copy of the PDF version of my paper on “Library 2.0: Balancing the Risks and Benefits to Maximise the Dividends” I discovered that such papers have a cover page appended as shown.

Having recently being a co-facilitator on a series of workshop on “Maximising the Effectiveness of Your Online Resources” I am well aware of best practices to help ensure that valuable resources can be easily discovered by search engines. And although papers in the repository do have a ‘cool URI’ prefixing the content of all papers in the repository with the same words (“University of Bath Open Online Publications Store” followed by “http://opus.bath.ac.uk/” and “This version is made available in accordance with publisher policies. Please cite only the published version using the citation below.” goes against best practices for Search Engine Optimisation.

The cover page isn’t the only concern I have with use of PDFs in institutional repositories.  Despite PDF being an ISO standard not all PDF creation programs will necessarily create PDF which conform with the standard, with papers containing mathematical formula or scientific notation being particularly prone to failing to embed the fonts needed to provide a resources suitable for long-term preservation.  I also suspect that, although it is possible to create accessible PDFs, I suspect that many PDF files stored in repositories will fail to conform with PDF accessibility guidelines.

Providing HTML Versions of Papers

In light of these reservations I have decided to provide a HTML version of my recent papers in the University of Bath institutional repository. So my paper on “From Web Accessibility to Web Adaptability” (for which the publisher’s embargo has recently expired) is available in HTML as well as PDF formats.

As I suspected, however, depositing the HTML version of the paper was slightly tricky.  I uploaded the paper using the Upload for URL option and this initial attempt resulted in the page’s navigational elements are search interface being embedded in the page.  And since the upload mechanism only uploads files which are ‘beneath’ the paper in the underlying directory structure the page’s style sheet was not included.  In short, the page looked a mess.

Since the HTML files I have created contain the contents of the paper separately from the page’s navigational elements it was not too difficult to create a very simple HTML file which I included (with the citation details appended at the end of the paper) in the resource which is available in the repository. As can be seen the contents are available even if the page is not visually appealing.

There are, of course, resource implications in creating HTML versions of papers. However it will be interesting to see if providing content which is more easily found in Google provides benefits in enhancing access to papers which are provided in HTML format  – and since resource discovery is one of the main aims of a repository it might be argued that resources should be provided to ensure that HTML versions of papers are made accessible.

But What About Richer XML Formats?

The purist might argue that whilst HTML is an open and Web-native resource is may not be rich enough for use with peer-reviewed papers. I have some sympathies which such views. Anthony Leonard has described how we should go about “Fixing academic literature with HTML5 and the semantic web“. I would agree that there’s a need to explore how HTML5 can be used in the context of institutional repositories.

But mightn’t there be another XML format we should consider? How about an open format which is widely supported and deployed and which, for many authors, will not require any changes to their authoring environment? The format is OOXML – an ECMA standard which has also been standardised as an International Standard (ISO/IEC 29500). However not all open standards are equally open and as this standard is based on Microsoft’s format for their office applications, as Wikipedia describes “the ISO standardization of Office Open XML was controversial and embittered“.

In light of this discussion, what format(s) would you recommend for use with institutional repositories?