A Ruby Plasma Data Engine based on DBPedia SPARQL queries

Thursday, 17 April 2008 | richard dale

I've been playing with using KIO::get() to make queries on the DBPedia SPARQL endpoint, parse the XML result set and convert it to be used by a Plasma Data Engine. I'll explain how it works as I think it is pretty useful and makes it very easy to link up applets with Semantic Web/Desktop data.

This is the basic SPARQL query, it takes the name of an artist and retrieves details of all the albums they've made - the album name, the urn of the album's DBPedia resource, creation date and cover art picture:

PREFIX p: <http://dbpedia.org/property/>  
PREFIX rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#>
SELECT * WHERE { 
     ?album p:artist  <http://dbpedia.org/resource/The_Velvet_Underground>.       
     ?album rdf:type <http://dbpedia.org/class/yago/Album106591815>.
     OPTIONAL {?album p:cover ?cover}.
     OPTIONAL {?album p:name ?name}.
     OPTIONAL {?album p:released ?dateofrelease}.
   }

I borrowed the example query from this article about making a timeline of albums. You post the query string to a url for the DBPedia SPARQL endpoint which is http://dbpedia.org/sparql, and the query results areturned in a simple to parse XML format. They look like this:

<?xml version="1.0" ?>
<sparql xmlns="http://www.w3.org/2005/sparql-results#"
 xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
 xsi:schemaLocation="http://www.w3.org/2001/sw/DataAccess/rf1/result2.xsd">
 <head>
  <variable name="album"/>
  <variable name="cover"/>
  <variable name="name"/>
  <variable name="dateofrelease"/>
 </head>
 <results distinct="false" ordered="true">
  <result>
   <binding name="album">
<uri>http://dbpedia.org/resource/1969:_The_Velvet_Underground_Live</uri></binding>
   <binding name="cover">
<uri>http://upload.wikimedia.org/wikipedia/en/3/3c/1969Live.jpg</uri></binding>
   <binding name="name"><literal xml:lang="en">1969: The Velvet Underground Live</literal></binding>
   <binding name="dateofrelease"><literal datatype=
   "http://www.w3.org/2001/XMLSchema#gYearMonth">1974-09-01 00:00:00.000000</literal></binding>
  </result>
  ...
 </results>
</sparql>

So if you're familiar with SQL queries, and SPARQL select query is very similar. In order to make it work well the the Plasma Data Engine model you need to decide which of the values is the most important, and in this case it's the album name.

The code to issue an HTTP request via KIO::get() is really short and simple. I wrote about using ActiveRDF to query in Get Semantic with DBPedia and ActiveRDF, and it was an interesting idea but didn't work very well. The open-uri get() call that the ActiveRDF SPARQL adapter uses would keep timing out even if you simplified the queries, and it was asynchronous which meant that a GUI app would just freeze while the query was being executed. KIO just chugs away in the background, calling the queryData() slot when ever some data arrived, until it calls the queryCompleted() and the data is ready to parse.

class SparqlDataEngine < Plasma::DataEngine
  slots 'queryData(KIO::Job*, QByteArray)',
        'queryCompleted(KJob*)'

  def initialize(parent, args, endpoint, query, primary_value)
    super(parent)
    setMinimumPollingInterval120 * 1000)
    @endpoint = endpoint
    @query = query
    @primary_value = primary_value
  end

  def sourceRequestEvent(source_name)
    if @job
      return false
    end

    @source_name = source_name
    @sparql_results_xml = ""
    query_url = KDE::Url.new("#{@endpoint}?query=#{CGI.escape(@query % @source_name.gsub(' ', '_'))}")
    @job = KIO::get(query_url, KIO::Reload, KIO::HideProgressInfo)
    @job.addMetaData("accept", "application/sparql-results+xml" )
    connect(@job, SIGNAL('data(KIO::Job*, QByteArray)'), self,
            SLOT('queryData(KIO::Job*, QByteArray)'))
    connect(@job, SIGNAL('result(KJob*)'), self, SLOT('queryCompleted(KJob*)'))
    setData(@source_name, {})
    return true
  end

  def queryData(job, data)
    @sparql_results_xml += data.to_s
  end

  def queryCompleted(job)
    @job.doKill
    @job = nil
    parser = SparqlResultParser.new
    REXML::Document.parse_stream(@sparql_results_xml, parser)
    parser.result.each do |binding|
      binding.each_pair do |key, value|
        # puts "#{key} --> #{value.inspect}"
        setData(binding[@primary_value].literal.variant.toString, key, Qt::Variant.fromValue(value))
      end
    end
  end

  def updateSourceEvent(source_name)
    sourceRequestEvent(source_name)
    return true
  end
end

I tweaked the XML parsing code in the ActiveRDF adapter to create Nepomuk Soprano nodes, and return a Ruby Array of Hashes, each hash having keys for the SPARQL query variable and Soprano::Nodes for the values. The code in the 'queryCompleted()' method above then walks through the results making Plasma setData() calls, which is how an engine submits its data. The first string of the setData() call is the album name, eg 'White Light/White Heat' for the Velvets, and the second string is the particular attribute, such as data of release, and the third argument is the Soprano::Node with the value wrapped up in a Qt::Variant.

This is the code that parses the XML using the Ruby REXML library:

# Parser for SPARQL XML result set. Derived from the parser in the
# ActiveRDF SPARQL adapter code. Produces an Array of Hashes, each
# hash contains keys for each of the variables in the query, and
# values which are Soprano nodes.
#
class SparqlResultParser
  attr_reader :result

  def initialize
    @result = []
    @vars = []
    @current_type = nil
  end
  
  def tag_start(name, attrs)
    case name
    when 'variable'
      @vars << attrs['name']
    when 'result'
      @current_result = {}
    when 'binding'
      @current_binding = attrs['name']
    when 'bnode', 'uri'
      @current_type = name
    when 'literal', 'typed-literal'
      @current_type = name
      @datatype = attrs['datatype']
      @xmllang = attrs['xml:lang']
    end
  end
  
  def tag_end(name)
    if name == "result"
      @result << @current_result
    elsif name == 'bnode' || name == 'literal' || name == 'typed-literal' || name == 'uri'
      @current_type = nil
    elsif name == "sparql"
    end
  end
  
  def text(text)
    if !@current_type.nil?
      @current_result[@current_binding] = create_node(@current_type, @datatype, @xmllang, text)
    end
  end

  # create ruby objects for each RDF node
  def create_node(type, datatype, xmllang, value)
    case type
    when 'uri'
      Soprano::Node.new(Qt::Url.new(value))
    when 'bnode'
      Soprano::Node.new(value)
    when 'literal', 'typed-literal'
      if xmllang
        Soprano::Node.new(Soprano::LiteralValue.new(value), xmllang)
      elsif datatype
        Soprano::Node.new(Soprano::LiteralValue.fromString(value, Qt::Url.new(datatype)))
      else
        Soprano::Node.new(Soprano::LiteralValue.new(value))
      end
    end
  end
  
  def method_missing (*args)
  end
end

Those two class are pretty generic and could be used for any similar SPARQL query, and you just need to subclass SparqlDataEngine to give it a specific query string and endpoint like this:

SPARQL_QUERY = <<-EOS
PREFIX p: <http://dbpedia.org/property/>  
PREFIX rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#>
SELECT * WHERE { 
     ?album p:artist  <http://dbpedia.org/resource/%s>.       
     ?album rdf:type <http://dbpedia.org/class/yago/Album106591815>.
     OPTIONAL {?album p:cover ?cover}.
     OPTIONAL {?album p:name ?name}.
     OPTIONAL {?album p:released ?dateofrelease}.
   }
EOS

#
# Customize the use of the SparqlDataEngine by giving it the url of an endpoint,
# a query to execute, and the name of the most important (or primary) value.
# The '%s' in the query text above is replaced with the source name, with any
# spaces replaced by underscores.
#
class DbpediaAlbumsEngine < SparqlDataEngine
  def initialize(parent, args)
    super(parent, args, 'http://dbpedia.org/sparql', SPARQL_QUERY, 'name')
  end
end

It's very little work indeed compared with the way you normally have to issue standard html requests and then parse the totally non-standard results. I had a look at some of the Weather applet's Ion code to get BBC forcasts and it was really very complicated, and it would be vastly simpler if you could get weather data via SPARQL instead. The last step is to make a .desktop file for your new engine:

[Desktop Entry]
Name=DBPedia Albums Data Engine
Comment=DBPedia album data for Plasmoids
X-KDE-ServiceTypes=Plasma/DataEngine
Type=Service
Icon=
X-KDE-Library=krubypluginfactory
X-KDE-PluginKeyword=plasma-engine-dbpedia-albums/dbpedia_albums_engine.rb
X-Plasma-EngineName=dbpedia-albums

And a simple CMakeLists.txt file to install it:

install(FILES plasma-dataengine-dbpedia-albums.desktop DESTINATION ${SERVICES_INSTALL_DIR} )
install(FILES dbpedia_albums_engine.rb DESTINATION ${DATA_INSTALL_DIR}/plasma-engine-dbpedia-albums)

You can use the Plasma engine explorer to test engines, and I enhanced the Ruby version slightly so it can show the contents of Soprano::Nodes within Qt::Variants. Here is what the browser looks like testing a new engine:

[image:3401 size=preview]

I'll try and add some stuff to the TechBase wiki about writing Ruby Plasma data engines and applets once the api has settled down a bit again, but I hope I've explained enough to get people playing with SPARQL queries as I think there could be a lot of application for the idea..