Fetch, Nepomuk, fetch!

Search - a very important topic when it comes to data in general. The same is true for metadata and all that is Nepomuk. I blogged about the virtual folders idea for KMail which will be realized through Nepomuk. But before that there is the "simple" desktop search. We know it from systems like Beagle or Strigi. With Nepomuk, however, a lot more is possible. We are just getting started.

Let me give a quick glace of what I am doing regarding search. Now that Strigi analyzes files and Nepomuk extensions to Dolphin allow to tag and comment files we surely want to reuse that information. On the list of simple ways to exploit the data in the Nepomuk store, search is No. 2 (No. 1 being a simple display of it). We want the desktop search to handle manual metadata like tags and automatically gathered metadata alike.

Well, that is possible and I am doing it already in playground:

[image:3273 align=middle size=preview]

Now isn't that nice? You can combine searching for tags with other metadata searches. So far so good. it gets better: Nepomuk is based on RDF/S/NRL ontologies. Thus, each metadata type and field is defined by an RDF resource. In most cases (for example Xesam) these come with proper rdfs:label definitions. Thus, Nepomuk can not only group the results automatically (see the File, Image, or Music groups) but can also generically handle search fields. What does that mean. Well, it means that when searching for "hastag:nepomuk", "hastag" will be matched to nao:hasTag automatically. The same would be true for "tag" as we are doing a fulltext search on field names. And even better: if the ontologies are translated (RDF supports language tags after all) you can search the same fields using your native language and the results will be grouped in your native language (I could use some help on setting up a translation system as for desktop files here). It all happens generically without any hardcoded mapping. Pretty cool, isn't it?

OK, so much for the outer shell. Let's dive into the code for a bit. (But please keep in mind that I have plans to wrap this into a nice search service soon which allows most application developers to perform their simple day-to-day queries without knowing much SPARQL.)

If we want to find the proper field to match in a field:value query we can do as follows:

QString field = getFieldNameWhateverFooBarBlaBla();
QString query = QString( "select ?p where { "
                         "?p <%1> <%2> . "
                         "?p <%3> \"%4\"^^<%5> . }" )
                    .arg( Soprano::Vocabulary::RDF::type().toString() )
                    .arg( Soprano::Vocabulary::RDF::Property().toString() )
                    .arg( Soprano::Vocabulary::RDFS::label().toString() )
                    .arg( field )
                    .arg( Soprano::Vocabulary::XMLSchema::string().toString() );
Soprano::QueryResultIterator labelHits = model->executeQuery( query, Soprano::Query::QueryLanguageSparql );

This will give us all direct hits for a properly (field) label. However, in most cases users will enter a slight variation of the actual label. Thus, we use a more fuzzy search:

QString query = QString( "select ?p where { "
                         "?p <%1> <%2> . "
                         "?p <%3> ?label . "
                         "FILTER(REGEX(STR(?label),'%4','i')) . }" )
                    .arg( Soprano::Vocabulary::RDF::type().toString() )
                    .arg( Soprano::Vocabulary::RDF::Property().toString() )
                    .arg( Soprano::Vocabulary::RDFS::label().toString() )
                    .arg( field );

The regular expression simply filters all properties with a label that matches our field string.

And then it gets a bit tricky as there is one problem left in Soprano: The RDF storage solutions we use (Redland or Sesame2) do not have performant full-text search indexes. Thus, for Soprano I implemented a wrapper that uses a CLucene index to provide a fast full-text index on all literal RDF triples (The Nepomuk server already uses it so there is no need to instantiate it on the client side). I have plans to hide this transparently under a nice Soprano query API but so far we do not have that. As a result we have to perform full-text queries and "normal" SPARQL queries separately (as always I need help implementing this).

Let's say we got a field URI from our previous search and stored it in fieldUri.

QString value = getSearchValueWhateverFooBarBlaBla();
Soprano::QueryResultIterator hits = model->executeQuery( fieldUri.toString() + ':' + value,
                                                         "lucene" );

And as a result we get all the resources that match the query.

This is just a small excerpt of what I am doing in the search client and what will soon be done in the search service but it should give you an idea of how things need to be done ATM. More complex queries are of course possible but the blog entry is already too long as it is. ;)


It would be great to get apps like Amarok and Digikam to integrate their tagging and rating systems with this, which I'm sure will be possible so, well, swing into action code monkeys! :-)

By Tom Chance at Mon, 02/11/2008 - 14:49

This is great to see, really exciting, Sebastian. But what would put the cherry on top of your blogs are LOADS of hyperlinks in them to definitions of terms and tutorial articles. I just don't know what an RDF resource is, I have a vague idea what an ontology is from devouring all the aKademy presentations and dot articles I can find, and I've no clue what the nao: namespace? is. The Semantic Desktop is a meeting of KDE and a large established semantics community, and the learning curve for pure KDE hackers is steep. To establish a robust and active community of users and people who develop amazing new stuff, not just a few curious 'toe-dippers', it would help if you could smooth that curve out for us and provide easy paths to expand our minds so we can perceive all the possibilitiesof this exciting new world.

By Will Stephenson at Mon, 02/11/2008 - 15:39

You are right. Stay tuned for some more details tomorrow. :)

By Sebastian Trüg at Mon, 02/11/2008 - 22:58

/me already waiting.. =) ;)

By mxttie at Tue, 02/12/2008 - 10:26


I find that Nepomuk could easily also index "applications" as resources besides just files. Indexing applications means finding and associating configuration backends (Konfig XT, gconf, dconf, plain text .properties...) for different applications. My hope is that Nepomuk could automatically find Thunderbird, Apache, Dolphin applications and settings and offer them appropriately. That means: when asked for mail clients it should return KMail, Thunderbird and Mutt and every other mail client I may have installed on the machine.

And mail configuration made in Thunderbird files should be accessible to KMail through Nepomuk.

My question is this:
Is anything similar to this on schedule? Is this a sound idea or is it totally baloony and undoable with Nepomuk?

Thanks for spending time to read this,

By bogdanbiv at Wed, 04/23/2008 - 17:53