On Thursday, 10 May I attended the Eduserv Symposium 2012 on “Big Data, Big Deal?“. The Symposium Web site introduced the topic by focussing on the excitement which currently surrounds the phase “Big Data”:

The hot IT buzzword of 2012, big data has become viable as cost-effective approaches have emerged to tame the volume, velocity and variability of massive data” – Edd Dumbill, O’Reilly Radar.

In the opening to the event, Andy Powell, the Eduserv Symposium chair, suggested that claiming 2012 as the year of Big Data probably means that the term will be over-hyped. Rather than revisiting suggestions as to what Big Data could do, Andy explained that the aim of the Symposium would be primarily to provide an opportunity to hear about what practitioners are actually doing.

This approach meant that many of the talks went into either technical details relation to software and systems (Hadoop, DB Couch, NoSQL, etc.) or of the application area (e.g. Genome sequencing). Due to my lack of expertise I will not attempt to summarise the details of the talks. If you do have an interest in the details of the presentations which were given you will be pleased to hear that recordings of the talks are available via the Eduserv Web site together with the speakers’ slides can also be accessed.

For those who are new to the area, my colleague Marieke Guy has summarised what is meant by Big Data:

Big data is considered to be data sets that have grown so large and complex that they present challenges to work with using traditional database management tools. The key factors are seen to be the “volume, velocity and variability” of the data.

Several of the talks addressed the relevance of Big Data in areas of scientific research. Although this is clearly of interest to the higher education sector, I felt that it was unfortunate that there were no talks on learning analytics. The popularity of the Learning Analytics and Knowledge 2012 conference, held in Vancouver on 29 April – 2 May 2012, indicates the importance of this area and as a number of people from JISC and JISC services attended the conference, I felt it would have been particularly useful if the symposium has addressed this topic – as I suggested at the event after hearing about how large retailers are gaining competitive advantages from analysis of purchasing patterns, although it may be interested to analyse electronics and cans of beans, analysis of data associated with student learning raises many interesting ethical issues which the sector needs to address.

The opening speaker who pointed out that the aggregation and analysis of large volumes of data would support evidence-based policy decisions. This is an approach I support, and over the past few years I have gathered small data in order to inform policy-making processes. For me the role of the data scientists and data journalists who can help to interpret, understand and communicate findings provided by data, big or small, will be important. For scientists the interpretation of the Big Data might inform the development of scientific understanding (as is the case in the Big Data being gathered by the Large Hadron Collider) whereas as we can be seen from the abstract for the talk on Making data a way of life for public servants given by Max Wind-Cowie Head, Progressive Conservatism Project Demos:

The data agenda has made great progress under this Government – particularly in the area of transparency. But public servants too often feel left out of the equation or, worse, see transparency as a threat. Too often the public sector looks at big data as a risk, a problem waiting to happen and a potential tool for undermining its work. If Britain is to truly reap the benefits of big data we need to make data – its collection and its use – a boon to public servants, not a burden.

The interest in Big Data in informing policy decisions by the Government clearly makes the subjectivity of the interpretation of the analysis of Big Data clearly an important issue!

My colleague Marieke Guy summarised some of the key themes in her report on the event, which included:

We don’t need to get hung up on the ‘big’ word. Many of the benefits of evidence-based policy decisions can be gained by analysis of data which may be regarded as Big based on the characteristics of ” volume, velocity and variability”.

The tools are now available. Marieke highlighted Hadoop, DB Couch, NoSQL which all allow people to work easily with data sets – and may address the issues of tools which can be used for managing Big Data in her session on “Big and Small Web Data” which will be held at the IWMW 2012 event. I should also mention that a post on “Analytics Reconnoitre: Notes on Open Solutions in Big Data from #esym12” by Martin Hawksey of JISC CETIS also highlights a range of tools and provides a useful set of links to further sources of information.

We don’t yet know what data to get rid of. The issue of preservation of Big Data was of particular interest to me in light of my involvement in Web preservation issues. Preservation experts often point out the importance of selection criteria to define resources which should be preserved. However, as we heard at the symposium, such selection criteria is based on an understanding of what should be regarded as important. For the preservation of scientific data the decisions will be based on an understanding of a particular model – but what if the model is found to be incorrect? Donald Rumsfeld famously suggested that:

[T]here are known knowns; there are things we know we know.
We also know there are known unknowns; that is to say we know there are some things we do not know.
But there are also unknown unknowns – there are things we do not know we don’t know.

To paraphrase this:

[T]here are known knowns; there are things we know we know are of value and worth preserving.
We also know there are known unknowns; that is to say we know there are some things we do not know whether they are worth preserving.
But there are also incorrect unknowns – there are things we thought we knew we were mistaken.

Martin Hawksey concludes his report on the event by encouraging readers to:

watch some of the videos from the Data Scientist Summit 2011 (I’m still working my way through but there are some inspirational presentations).

I agree with Martin – there were some excellent talks at the event. I would also thank Andy Powell and his Eduserv colleagues for the live-streaming and for making the videos available shortly after the event was over. I was also pleased when I discovered that the videos have been made available on Eduserv’s YouTube channel, which means that I can now embed the Opening keynote – Big Data and implications for storage: Rob Anderson at Eduserv Symposium 2012 in this post: