PDF Metadata – Why Is it So Poor?

Metadata in PDF sourcePDF metadata – why so poor? asked Ross Mounce in a blog post published on New Year’s eve.

In the post Ross expressed surprise that although “with published MP3 files of audio you get rather good metadata … the results from a little preliminary survey of academic publisher PDF metadata” were poor: “Out of the 70 PDFs I’ve published (meta)data on over at Figshare, only 8 of them had Keywords metadata embedded in them“.

This made we wonder about the quality of the metadata for papers I have uploaded to Opus, the University of Bath repository.

I looked at a paper on A Challenge to Web Accessibility Metrics and Guidelines: Putting People and Processes First which is available in Opus in PDF and MS Word formats.

I first used Adobe Acrobat in order to display the metadata for the original source PDF file, prior to uploading to the repository. As can be seen from the accompanying screen shot the metadata included the title, the author details (with the email address for one of the authors) and two keywords.

Metadata for repository copy of paperHowever looking at the display for the PDF downloaded form the repository we find that no metadata is available!

This PDF differs from the original source in that a cover page is added dynamically by the repository in order to provide appropriate institutional branding. It would appear that in the creation of the new PDF resource, the original metadata is lost.

Metadata for MS Word masterLooking at the metadata created in the original source document – an MS Word file – we can see how the authors’ names which were subsequently concatenated into a single field. We can also see that although the title of the paper was given correctly, poor keywords had been included, which did not reflect the keywords which were included in the paper itself (Web accessibility, disabled people, policy, user experience, social inclusion, guidelines, development lifecycle, procurement).

I suspect that I am not alone in not spending much time in ensuring that appropriate metadata is embedded in the master source of a peer-reviewed paper. I have also previously not considered how such metadata might be lost in the workflow processes when uploading to an institutional repository: after all, surely the important metadata is added when the paper is deposited into the repository?

Ross’s blog post made me check the embedded metadata – and I discovered that the correct metadata is still included in the MS Word file which was uploaded to the repository along with the PDF copy.

Does the loss of the metadata embedded in the PDF matter? After all, surely people will use the search facilities provided in the repository in order to find papers of interest?

But people will not necessarily visit a repository to find papers of interest. A post which described A Survey of Use of Researcher Profiling Services Across the 24 Russell Group Universities showed that on 1 August 2012 there were over 18,000 users of ResearchGate in the 24 Russell Group universities and judging by the messages along the lines of “28 of your colleagues from University of Bath have joined ResearchGate in the last month. Why not follow them today?” which I am currently receiving, use of this service is growing.

researchgate-papers-abstractAs can be seen from the screenshot of my ResearchGate profile, the service provides access to PDF copies of my papers. I normally simply provide a link to the PDF hosted in the repository but the example illustrated contains a copy of original PDF which was uploaded to the service by one of the co-authors.

In the case of most of my papers it is clear from the thumbnail of the PDF that the paper contains the coversheet provided by the repository.

Researchgate Paper (hosted in Opus)


We can see that the PDF copy of a paper hosted in a repository should not be regarded as a final destination; rather the PDF may be surfaced in other environments.

It will therefore be important to ensure that workflow processes do not degrade the quality of the PDF. It will also be important to ensure that authors are made aware of how embedded metadata may be used by services beyond the institutional repository. But to what extend do repository managers feel they have a responsibility to advise on practices which will enhance the discoverability of content on services hosted outside the institution?

Taylor FrancisIn a paper which asked “Can LinkedIn and Academia.edu Enhance Access to Open Repositories?” myself and Jenny Delasalle commented on how “commercial publishers are encouraging authors to use social media to drive traffic to papers hosted on publishers’ web sites” and provided examples of such approaches from Taylor and Francis, Springer, Sage and Oxford Journals. As an example, Taylor and Francis describe how they are “committed to promoting and increasing the visibility of your article and would like to work with you to promote your paper to potential readers” and go on to document services which can help achieve this goal.

In a blog post which discussed the ideas describe din the paper I described how we had failed to find significant evidence of similar approaches being employed by repository managers:

It was interesting that in Jenny’s research she found that a number of commercial publishers encourage their authors to use services such as LinkedIn and Academia.edu to link to their papers hosted behind the publishers paywalls – and yet we are not seeing institutional views of the benefits of coordinated use of such services by their researchers. Institutional repository managers, research support staff and librarians could be prompting their institutions to make the most of these externally provided services, to enhance the visibility of their researchers’ work in institutional repositories.

But that paper was limited to use of third-party services to provide access routes to research papers. What of the bigger picture in which institutional work flow processes should be designed to enhance discoverability?

The ‘inside-out and outside-in library’

On Wednesday in a post entitled Discovery vs discoverability … Lorcan Dempsey explored the idea of the “inside-out and outside-in library“. In the post Lorcan described how:

Throughout much of their existence, libraries have managed an outside-in range of resources: they have acquired books, journals, databases, and other materials from external sources and provided discovery systems for their local constituency over what they own or license.

However in a digital and network world, there have been two major changes, which shift the focus towards inside-out:

First access and discovery have now scaled to the level of the network: they are web scale. If I want to know if a particular book exists I may look in Google Book Search or in Amazon, or in a social reading site, in a library aggregation like Worldcat, and so on. … Secondly the institution is also a producer of a range of information resources: digitized images or special collections, learning and research materials, research data, administrative records (website, prospectuses, etc.), faculty expertise and profile data, and so on.

Lorcan goes on to describe the challenge facing libraries:

How effectively to disclose this material is of growing interest across libraries or across the institutions of which the library is a part. This presents an inside-out challenge, as here the library wants the material to be discovered by their own constituency but usually also by a general web population.

I would suggest that institutional repositories could usefully adopt the approach taken by Taylor and Francis:

 “[The institution is] committed to promoting and increasing the visibility of your article and would like to work with you to promote your paper to potential readers

But rather than simply encourage researchers to simply add links to papers deposited in the repository from popular services such as LinkedIn and ResearchGate might the institutional goal be enhanced by encouraging researchers to make the content of their papers available in such third party services (subject to copyright considerations) – with the institutional repository providing both a destination and a component in a workflow, with papers being surfaced in services such as ResearchGate, as I have illustrated above.

If such an approach were to be embraced there would be a need to ensure that embedded metadata was not corrupted through repository workflow processes. If, however, the repository is regarded as the sole access point, there would be little motivation to address such limitations in the work flow.

Or to put it another way, repository managers will have a need to manage content hosted within the institution, including management to support the use of the content by services they have no control over.

To a certain extent, this has already been accepted: repositories were designed to have “cool URIs” which can help resources to be discovered by Google. I am suggesting that there is a need to observe usage patterns which indicate emerging ways in which users are finding content. The growing numbers of email alerts from ResearchGate suggest that it may be a service to monitor – with Ross Mounce’s recent post of on the quality of metadata embedded in PDFs suggesting one area in which there will be a need to revisit existing workflow processes.

PS. Ross Mounce described “a little preliminary survey of academic publisher PDF metadata” and has published the data on Figshare. Has anyone harvested the metadata embedded in PDFs hosted on repositories and published the findings?

View Twitter conversation from: [Topsy]