Back in September 2001 I gave a talk at the JANET User Support Workshop, which was held at Loughborough University. I remember a Pro Vice Chancellor giving the welcome talk during which he mentioned that “Loughborough has the highest proportion of students of any place in the UK” (or words to that effect). I remember him saying that as I worked at Loughborough University from 1984-90 and I was interested in seeing how the increases in the numbers of students was changing the town centre – there were a number of superpubs which weren’t there when I lived in the town.

Last November I spent a few days at Aberystwyth University. While I was there, on my way to a CAMRA pub, I noticed large numbers of students (dressed as doctors and nurses) on a pub crawl around the town. This made me wonder if a small place like Aberystwyth might have overtaken Loughborough as the town or city in the UK with the largest proportion of students.

That was the background to my recent “Challenge To Linked Data Developers” in which I asked “Which town or city in the UK has the largest proportion of students?“. In order to simplify the challenge and avoid the need for SPARQL developers to have to track down official relevant data sources I asked that the challenge be addressed using data held in DBpedia, the RDF datastore of structured information provided in Wikipedia. An additional aim was to gain an understanding of the quality of the data (and the data structures) held in DBpedia, which is frequently mentioned as having a central role to play in the Linked Data world.

A week after issuing my challenge I published the “Response To My Linked Data Challenge“. However the answers obtained from querying DBpedia were clearly incorrect – Cambridge, for example, doesn’t have a population of 12!

On the DCC blog Chris Rusbridge has revisited my challenge in a post entitled “Linked Data and Reality“. Chris suggested that “If we care about our queries, we should care about our sources; we should use curated resources that we can trust. Resources from, say… the UK government?“. That may be true, but I wasn’t primarily after the correct answer when I formulated my challenge – I was more interested in whether DBpedia could provide a reasonable answer, how long it might take to write a SPARQL query and how complex such a query might be. This motivation was acknowledged by bitwacker in his comment that “I think Brian’s challenge should be seen as only a benchmark, a sampling of the effectiveness of linked data practices today.” That’s right – and I’m pleased to have noticed recently that the DBpedia community have recently issued an “Invitation to contribute to DBpedia by improving the infobox mappings“. In addition Kingsley Idehen alerted me to Yago, Opencyc, Umbel, and Sumo ontologies, all of which have binding to DBpedia. (I should also add that Kingsley has written a blog post on “DBpedia receives shot #1 of CLASSiness vaccine” which illustrates how new ontologies can be integrated with DBpedia).

Perhaps DBpedia could have a role to play in answering the type of query I posed – after all, if you want to compare the proportions of students in towns and cities across several countries, mightn’t DBpedia be an easier place to seeks an initial answer, rather than having to find and query statistics from each of the individual countries (especially as the UK Government seems to be taking a leading role in expressing a commitment to Linked Data).

In addition to suggesting that the query should use official Government sources of data (which Chris Wallace has used to provide an answer to my query) Chris also raised the issue about the need to seek clarity in the queries we pose. Using the Guardian Platform Chris Wallace found that the place with the highest proportion of students is Milton Keynes. Chris Rusbridge suggested this in an initial discussion on a LinkedIn Linked Data discussion. And yes, the home of the Open University, is likely to have a large number of registered students. But I don’t think the place will be full of students at the start of the academic year since the Open University is a distance learning institution. The (implied) context of my query was the place for which a significant proportion of students would be likely to affect the local environment, with large numbers of students in town during freshers pub crawls and, perhaps, little happening during vacations. So we should rule out the Open University. But what about other universities with a large number of students on distance learning courses? According to a tweet from lordllamaAbout 41% of 23,000 students at Leicester University are on distance learning courses“.

There is also the question of how we should treat institutions such as the University of Brighton in Hastings which “offers University of Brighton degrees“.  As Margaret Wallis pointed out in response to my initial blog post this institution has  “grown in six years from 40 students to 600+“. But should those students be included in the totals for the Univeristy of Brighton or for Hastings? The general question is how we should treat institutions which have multiple campuses, split across different towns or, as may be case in this example, institutions which award degrees on befalf of other institutions.

You may also notice that my question about places with a large proportion of students is now talking about universities and university students. But what about students at FE colleges? And school children?

Chris Rusbridge highlighted such complexities: “The point is, these things are hard. Understanding your data structures and their semantics, understanding the actual data and their provenance, understanding your questions, expressing them really clearly: these are hard things.” Chris concluded “I’m beginning to worry that Linked Data may be slightly dangerous except for very well-designed systems and very smart people…” Chris probably had his tongue in his cheek with his ‘smart people‘ remark but he may be right with his warning that Linked Data might be dangerous. If a simply query such as “Which town or city in the UK has the largest proportion of students?” is open to a number of different interpretations, what are the implications for more complex queries.

In my “Response To My Linked Data Challenge” I described how Tim Berners-Lee introduced the Semantic Web by described how it aimed to provide an answer to a query such as “Is there a green car for sale for around $15000 in Queensland?“. Tim described how, unlike the search engines of the day, a Semantic Web query would be able to find a result which was described as “Affordable maroon saloon for sale in Brisbane”. But this query is seeking to find additional results which would not be found by a traditional keyword search. The “Which town or city in the UK has the largest proportion of students?“, however, is seeking to find a single answer. Might there be types of queries for which Linked Data might work and others for which if may be difficult or expensive to model the data? Or to rephrase the question what, specifically, is Linked Data for?