I’ve never been much of a fan of the PDF format. Back in the early days of the Web I had hoped that the proprietary PDF format would be replaced by HTML and CSS. Back then there was an expectation that CSS would be developed to provide the fine control over page layout that is available using word processing and DTP applications.  The development of the Document Object Model (DOM) for HTML/XML various also promised to deliver an environment in which such resources could be interrogated and manipulated in ways which would not be possible with more monolithic resources such as PDFs. And finally HTML and CSS provided accessibility benefits not available in PDF.

However over the years it became apparent that HTML/CSS wouldn’t provide such fine layout control. And we found that HTML as used in the real world tended to be a structural mess, sometimes referred to as ‘tag soup’.

We also discovered that in many cases users preferred PDFs, especially for resources which designed as printed documents.

And last year PDF became an ISO standard, following on from the standardisation of PDF/A as an archival format.

So PDF is now an open standard, is suitable for archival purposes, has widespread support, accessible PDFs can now be created – and there is also an Adobe SDK which supports the development of applications to create and process PDF files.

Sounds good, doesn’t it? But in practice, do PDF files actually conform to the PDF standard? And although PDF files can be accessible, in practice do the PDF files which are produced in normal work flow processes  actually comply with accessible PDF guidelines?

I recently searched for PDF validation tools.  I found that a number of tools were available, many of which were expensive to purchase. I made use of one free email-based tools (Validatepdfa) and used it to report on the conformance of a couple of PDF files for recent peer-reviewed papers which I had submitted to journal / conference organisers. Although these files may have conformed with the publisher’s layout and house style requirements, I found the tool found quite a number of error As you can see the error messages aren’t particualrly helpful and it is difficult to see how such errors can be remedied:

Issues addressed (1) File structure Incorrect delimiter used for indirect object 340 0
Issues addressed (2) File structure Incorrect delimiter used for indirect object 370 0
Issues addressed (3) File structure Missing ID in trailer dictionary

Issues addressed (118) Fonts Font ‘TrebuchetMS-Bold’ was successfully substituted and embedded
Issues addressed (119) Fonts CID font subset without CIDSet
Issues addressed (120) Fonts CIDToGIDMap has been successfully embedded in Type2 font LHCKAJ+SymbolMT.
Issues addressed (121) Fonts CID font subset without CIDSet

I then used the Adobe Acrobat software to report on any accessibility problems with the papers. I used this tool to analyse all of my peer-reviewed papers which I have written in the past 10 years – and found that none of the papers actually conformed with Adobe’s accessibility guidelines.

The error messages provided in Adobe Acrobat were mostly helpful and it seemed that one consistent problem was the lack of a language to describe the contents of the document. Fortunately Adobe Acrobat does allow some of the accessibility problems to be fixed with the software – so I assigned the language English to all of the documents. Some of my papers now do conform with PDF accessibility guidelines (at least as far as automated checking tools can detect) – but the documents which had been uploaded to the University of Bath’s institutional repository a few months ago will be the non-accessible versions. There are issue about the workflow processes for uploading papers to institutional repositories: who should have a responsibility for ensuring compliance with guidelines; at what stage should appropriate metadata be added; who should ensure that the metadata is correct; what tools can be used to create and maintain such metadata; what level of detail should be provided; how do we ensure that the metadata isn’t corrupted during workflow processes; etc. Did you really think that using PDF was easy?

I suspect that most people aren’t particularly interested in conformance of such resources with PDF standards and accessibility guidelines – although it was reassuring to see the post on”Survey on malformed PDFs?” on the DCC blog.

But if we are serious about the importance of standards, particularly in the context of digital preservation, and if we are serious about the accessibility of digital resources, we will need to ensure that our workflow practices result in resources on our Web sites and institutional repositories which are conformant.

Or perhaps strict conformance with standards and accessibility guidelines is over-rated. Should we simply acknowledge that the ease of creation of PDF resources is key to the creation of such resources and adding additional steps into the workflow processes will add unnecessary complexities and barriers?