Fetch, Nepomuk, fetch!

Monday, 11 February 2008 | Trueg

Search - a very important topic when it comes to data in general. The same is true for metadata and all that is Nepomuk. I blogged about the virtual folders idea for KMail which will be realized through Nepomuk. But before that there is the "simple" desktop search. We know it from systems like Beagle or Strigi. With Nepomuk, however, a lot more is possible. We are just getting started.

Let me give a quick glace of what I am doing regarding search. Now that Strigi analyzes files and Nepomuk extensions to Dolphin allow to tag and comment files we surely want to reuse that information. On the list of simple ways to exploit the data in the Nepomuk store, search is No. 2 (No. 1 being a simple display of it). We want the desktop search to handle manual metadata like tags and automatically gathered metadata alike.

Well, that is possible and I am doing it already in playground:

[image:3273 align=middle size=preview]

Now isn't that nice? You can combine searching for tags with other metadata searches. So far so good. it gets better: Nepomuk is based on RDF/S/NRL ontologies. Thus, each metadata type and field is defined by an RDF resource. In most cases (for example Xesam) these come with proper rdfs:label definitions. Thus, Nepomuk can not only group the results automatically (see the File, Image, or Music groups) but can also generically handle search fields. What does that mean. Well, it means that when searching for "hastag:nepomuk", "hastag" will be matched to nao:hasTag automatically. The same would be true for "tag" as we are doing a fulltext search on field names. And even better: if the ontologies are translated (RDF supports language tags after all) you can search the same fields using your native language and the results will be grouped in your native language (I could use some help on setting up a translation system as for desktop files here). It all happens generically without any hardcoded mapping. Pretty cool, isn't it?

OK, so much for the outer shell. Let's dive into the code for a bit. (But please keep in mind that I have plans to wrap this into a nice search service soon which allows most application developers to perform their simple day-to-day queries without knowing much SPARQL.)

If we want to find the proper field to match in a field:value query we can do as follows:

QString field = getFieldNameWhateverFooBarBlaBla();
QString query = QString( "select ?p where { "
                         "?p <%1> <%2> . "
                         "?p <%3> \"%4\"^^<%5> . }" )
                    .arg( Soprano::Vocabulary::RDF::type().toString() )
                    .arg( Soprano::Vocabulary::RDF::Property().toString() )
                    .arg( Soprano::Vocabulary::RDFS::label().toString() )
                    .arg( field )
                    .arg( Soprano::Vocabulary::XMLSchema::string().toString() );
Soprano::QueryResultIterator labelHits = model->executeQuery( query, Soprano::Query::QueryLanguageSparql );

This will give us all direct hits for a properly (field) label. However, in most cases users will enter a slight variation of the actual label. Thus, we use a more fuzzy search:

QString query = QString( "select ?p where { "
                         "?p <%1> <%2> . "
                         "?p <%3> ?label . "
                         "FILTER(REGEX(STR(?label),'%4','i')) . }" )
                    .arg( Soprano::Vocabulary::RDF::type().toString() )
                    .arg( Soprano::Vocabulary::RDF::Property().toString() )
                    .arg( Soprano::Vocabulary::RDFS::label().toString() )
                    .arg( field );

The regular expression simply filters all properties with a label that matches our field string.

And then it gets a bit tricky as there is one problem left in Soprano: The RDF storage solutions we use (Redland or Sesame2) do not have performant full-text search indexes. Thus, for Soprano I implemented a wrapper that uses a CLucene index to provide a fast full-text index on all literal RDF triples (The Nepomuk server already uses it so there is no need to instantiate it on the client side). I have plans to hide this transparently under a nice Soprano query API but so far we do not have that. As a result we have to perform full-text queries and "normal" SPARQL queries separately (as always I need help implementing this).

Let's say we got a field URI from our previous search and stored it in fieldUri.

QString value = getSearchValueWhateverFooBarBlaBla();
Soprano::QueryResultIterator hits = model->executeQuery( fieldUri.toString() + ':' + value,
                                                         Soprano::Query::QueryLanguageUser,
                                                         "lucene" );

And as a result we get all the resources that match the query.

This is just a small excerpt of what I am doing in the search client and what will soon be done in the search service but it should give you an idea of how things need to be done ATM. More complex queries are of course possible but the blog entry is already too long as it is. ;)