January 2016 – Naive Logic

As I mentioned before I was struggling to setup a functioning SPARQL endpoint for the Wikibase installation at librarybase.wmflabs.org. However, this struggle now appears to be at an end and has now been (basically) successfully functioning for over a week.

It turns out the reason for the updater failing was quite complex and interesting. I go on to describe what I believe the problem was.

Fundamentally the problem stemmed from the SPARQL endpoint returning unexpected responses to query’s similar to this:

SELECT DISTINCT ?s WHERE { 
  VALUES (?s ?rev) { 
    ( <http://librarybase.wmflabs.org/entity/Q64433> 4325 ) } 
  OPTIONAL { 
     ?s <http://schema.org/version> ?repoRev 
  } 
  FILTER (!bound(?repoRev) || ?repoRev < ?rev) 
}

The idea behind this query is to test each pair of URI and integer in the values section against the criteria in the FILTER block. The idea is to test if the triple store of the SPARQL end point stores a version number for the given URI and then to test if it is less than the integer provided in the values section. If the triple store value is less or it doesn’t exist at all the query should return the URI. If a list of URIs is provided it should return a list of URIs meeting this criteria.

This is how the endpoint updater identified entities for which new data was available. However the end point was actually returning a large list of URIs. This was because it was treating the URI in the values section as UNDEF if it wasn’t already in the triple store. Therefore it only tried to match the ‘version’ integer part of the VALUES section and typically this could match many subjects, including some that were not of the form http://librarybase.wmflabs.org/entity/Q123. Because the data returned from this query was not checked for validity it tried to perform changes on on objects which don’t exist (hence the NullPointer exception I described earlier). I think because it determines what needs changing based on simple string operations on the URI(s) it gets back.

The next question is why the query engine treated URIs not in the triple store as UNDEF. It didn’t seem to be a problem for an endpoint running the same code available at http://query.wikidata.org (the WDQS). That was until I started playing around with the query, and submitted a similar one. In order to test for an entity I knew to not exist I entered a keymash of digits for it’s ID. Luckily I accidentally also struck the q key in this process (e.g. <http://www.wikidata.org/entity/Q12345q12345>). This gave me a similar situation from WDQS; a long list of URIs I hadn’t put in my VALUES section. Of course this an invalid ID and if you follow it you get HTTP error 400 (Bad request)

This meant that somehow the endpoint (which I assumed was isolated from the world) ‘knew’ which entities where good and which were bad. I realised that on libraybase (as with any vanilla Wikibase installation) www.site.org/entity/Q123 didn’t actually go anywhere; it only works on www.wikidata.org and test.wikidata.org because there is a a URL rewrite in place to take you to: https://www.wikidata.org/wiki/Special:EntityData/Q123.

I fixed this by putting a rewrite in place on the librarybase webserver but the problem still existed. I then approached Stas Malyshev with this information (the main author of the WDQS endpoint code) and head realised that there was a chance that there were some hardcoded references (in /common/src/main/java/org/wikidata/query/rdf/common/uri/WikibaseUris.java for those trying to make it work for themselves) in the endpoint rather than in the updater to make it handle Wikidata URIs differently. Changing these to librarybase.wmflabs.org and rebuilding meant everything suddenly started working perfectly!

All in all a rather complicated puzzle but one I was very thankful to have solved so I could move on with importing data into Librarybase.

Month: January 2016

Running your own Wikibase SPARQL endpoint