My recent invitation to Linked Data developers to illustrate the potential benefits of Linked Data by providing an answer to a simple query using DBpedia as a data source generated a lot of subsequent discussion. A tweet by Frank van Harmelen (the Dutch Computer Scientist and Professor in Knowledge Representation & Reasoning, I assume) summarised his thoughts of the two posts and related behind-the scenes-activities: “insightful discussion on #linkeddate strengths, weaknesses, scope and limitations“.

But as described in the post, the answer to the question “Which town or city in the UK has the largest proportion of students?” was clearly wrong.  And if you view the output from the most recent version of the query, you’ll see that the answers are still clearly incorrect.

We might regard this ‘quick fail’ as being of more value that the ‘quick win’ which I had expected initially, as this provides an opportunity to reflect onthe processes needed to debug a Linked Data query.

As a reminder here is the query:

#quick attempt at analyzing students as % of population in the United Kingdom by Town
#this query shows DBpedia extraction related quality issues which ultimately are a function of the
#wikipedia infoboxes.

prefix dbpedia-owl:
prefix dbpedia-owl-uni:
prefix dbpedia-owl-inst:

select distinct  ?town ?pgrad ?ugrad  ?population (((?pgrad + ?ugrad) / 1000.0 / ?population ) ) as ?per where {
?s dbpedia-owl-inst:country dbpedia:United_Kingdom;
   dbpedia-owl-uni:postgrad ?pgrad;
   dbpedia-owl-uni:undergrad ?ugrad;
   dbpedia-owl-inst:city ?town.
optional {?town dbpedia-owl:populationTotal ?population. filter (?population >0) }
group by ?town having (((?pgrad + ?ugrad) / 1000.0 / ?population ) ) > 0
order by desc 5

As can be seen, the query is short and, for a database developer with SQL expertise, the program logic should be apparent. But the point about Linked Data is the emphasis on the data and the way in which the data is described (using RDF). So I suspect there will be a need to debug the data. We will probably need answers to questions such as “Is the data correct in the original source (Wikipedia)?“; “Is the data correct in DBpedia?“; “Is the data marked-up in a consistent fashion?“; “Does the query process the data correctly?” and “Does the data reflect the assumptions in the query?“.

Finding an answer to these questions might be best done by looking at the data for the results which were clearly in error and comparing the data with results which appear to be more realistic.

We see that Cambridge has a population of 12 and Oxford a population of 38. These are clearly wrong. My initial suspicion was that several zeros were missing (perhaps the data was described in Wikipedia as population (in tens of thousands).   But looking at the 0ther end of the table, the towns and cities with the largest populations include Chatham (Kent) with a population of 70,540, Stirling (41,243) and Guildford (66,773) – the latter population count agrees with the data held in Wikipedia.

In addition to the strange population figures,there are also questions about the towns and cities which are described as hosting a UK University. As far as I know neither Folkestone nor Hastings has a University. London, however, has many universities but is missing from the list.

My supposition is that the population data is marked up in a variety of ways  – looking at the Wikipedia entry for Cambridge, for example, I see that the info table on the right of the page (which contains the information used in DBpedia) has three population counts: the district and city population (122,800), urban population (130,000), and county population (752,900). But by querying the DBpedia query results I find three values for population: 12, 73 and 752,900.

The confusions regarding towns and cities which may or may not host UK universities might reflect real world complexities – if a town hosts a campus but the  main campus is located elsewhere, should the town  be included? There’s not a clear-cut answer, especially when, as in this case, the data, from Wikipedia, is managed in a very devolved fashion.

I’ve suggested some possible reasons for the incorrect results to the SPARQL query and I am sure there may be additional reasons (and I welcome such suggestions).  How one might go about fixing the  bugs is another question. Should the data be made more consistent?  If so, how might one do this when the data is owned by a distributed query?  Or isn’t the point of Linked Data being that the data should be self-describing – in which case perhaps a much more complex SPAQL query is needed in order to process the complexities hidden behind my apparently simple question.