On Friday 7th October 2011 I attended a one-day event on “The future of the past of the web“. The event, which was organised was organised by the British Library, the Digital Preservation Coalition (DPC) and the JISC, was the third joint Web archiving workshop, the previous two workshops having been held in 2006 and 2009 .
I have had an interest for some time having given a talk way in 2002 on “Archiving The UK Domain and UK Web Sites: What Are The Issues?” at a DPC seminar on “Web-archiving: managing and archiving online documents and records“. It seems that the Web archiving world changed significantly since I gave my talk and, indeed, since the first two workshops. As a number of people commented, many of those involved in Web archiving initiatives are no longer primarily focussed on archiving conventional Web ‘pages’ – rather the sector is facing the challenges in archiving a much more dynamic environment, with the Social Web now providing significant content which social historians of the future will wish to analyse in order to make sense of today’s online (and offline) environment.
The changes in emphasis can also be seen from the developments of end user services which can help to make the importance of Web archiving move obvious to the wider community. In the opening plenary talk Herbert van der Sompel described Memento, an initiative which is looking to “add time to the Web” by developments which build on existing web protocols including HTTP and content negotiation.
A Memento plugin for Firefox is available which enables end users to gain an understanding of benefits which such developments can provide. I was also pleased to hear that a Memento Browser is available for Android mobile devices. For those who may not be able to install such applications, use of Memento’s capabilities can also be seen by using the Internet Archive’s Wayback Machine. As can be seen from the accompanying image you can view the BBC News Web site for October 2008, and perhaps reminisce about the early days of the financial crisis.
Further examples of rich interactive interfaces to Web archives have been developed to enhance the UK Web Archive service and, as described by Maureen Pennock and Lewis Crawford, this includes N-Gram visualisations of searches across the archive, tag clouds generated from the General Election 2005 Collection and a 3D wall visualisation across archived collections.
Services provided by the British Library have, of course, always been valued by researchers. But in a talk on “Web Archiving: the State of the Art and the Future” Eric Meyer, Research Fellow
Director at the Oxford Internet Institute, asked us to consider how effective we have been in making social science researchers aware of the potential of Web archives in supporting their research. There is, I feel, a need for further advocacy for ensuring that researchers are aware of the ways in which not only archived digital resources, but also data associated with such archives, can sup[port research interests.
The increasing importance of Web archiving has led to archiving tools and services being developed within the commercial sector in addition to activities led by national libraries and archives, higher education and EU-funded consortia. Mark Williamson was invited to give a presentation at the last minute and described various archiving activities of his company, Hazno. It was interesting to hear how a well-known multi-national company such as Coca Cola, which, as might be expected, has well-established archiving processes for archiving of physical objects but was slow in recognising the importance of digital archiving, including initially the development of its public Web site and then its public presence on social web sites including the Coca Cola Facebook page. Mark also described how APIs are being developed for the Hazno Web archiving system and how the APIs would be valuable in analysing the data associated with large collections of Web archives. As Mark put it: “The individual pages in a web archive are pretty boring – it’s the Big Data that’s exciting“. It will be interesting to see whether the Hazno software could provide a solution for Universities which may be interested in archiving their digital presence, especially uses of social web services for which the content cannot be managed through use of a content management system used to manage the institutional Web presence.
As well as finding the talks at the workshop of interest it was also interesting to observe the gaps. In the final session Neil Grindley, JISC Programme Manager for digital preservation asked the panel for their thoughts on standards for web archiving – and found that no one on the panel. However in response to my tweet that:
Helen Hockx commented that:
@briankelly I agree. Both ISO and BSI have initiated and are going to initiate work on standards related to web archiving.
If the next Web archiving event is held in another two years time, it will be very interesting to see what the focus of development work will be. Ten years ago the drive for Web archiving came from national and international bodies. However as suggested in a tweet posted by Les Carr a few hours ago who provided a link to a blog post on EPrints repositories to collect data from Twitter perhaps we shall see institutions appreciating the value of digital content created by members of the institution, including content hosted outside of the institution. Or perhaps, as suggested by the EU-funded Arcomem project, it may be large EU-funded projects which help to preserve todays’ cultural memories which are help on online service, including social web services. And although motivated individuals may wish to make use of tools such as Memolane, a “Social Web application that captures all of your memories from different Social Networks like Flickr, Facebook, Twitter, Youtube ” highlighted on the Arcomem Website as a “Personal Timemachine for the Social Web“, in reality I don’t think we can leave it to individuals to take responsibility for preserving their own public content. Of course, this begs the question of ‘walled gardens’ which apparently mean that content cannot be accessed by third parties and issues such as privacy and copyright. I wonder if the next Web archiving workshop will have got bogged down by the difficulties which such issues raise, or if ways of circumventing such difficulties may have been found?