searching progress

Thursday, 23 November 2006 | Oever

Strigi is moving along at a nice pace. To keep you all posted I'd like to report a bit on what exactly is the progress that has been achieved. Part of it is in SVN and will be in 0.3.10. Part of it has been released in 0.3.9. (0.3.10 is not too far away).

The current development model of Strigi has much in common with the 2.6 kernel line: new features are being added whilst keeping stability but without fear of breaking APIs.

Qt4 DBus bindings

Uptake in KDE4 can only happen if it is easy. To make life really easy for developers, Strigi can be used over DBus. This means you can do searches from your favorite language. Normally, for C++ developers this still requires generating code from the DBus introspection XML. This is a bit of effort that can be avoided by using the new pregenerated code that comes with Strigi. Two classes are included: StrigiClient and StrigiAsyncClient. Using them is easy: create an instance and call the functions on it. StrigiAsyncClient has an internal queue and allows you to use signals and slots. It also allows you to remove queries from the queue if you do not need them anymore. This is very common if you make a search-as-you-type widget. In the unlikely event that Strigi has not performed the query between keystrokes, these queries can be cancelled.

Refactoring

The current version of Strigi is very ambitious: it extracts all info it can all the time. This is laudable, but not always required. A good example is the Strigi program 'deepfind'. This program works like 'find'. It lists the paths of all files in a folder. Deepfind also lists the paths of the files contained in other files (and deeper). So the indexer code does not need to extract the full text of each file. Using this knowledge can speed up 'deepfind' a lot.

The same holds for deepgrep (the deep version of grep). Deepgrep is not just an advanced grep, it can also serve as a good fallback for searching in directories that have not been indexed. But for this it should be as fast as possible. With the refactoring that has been done, it is now possible to add a configuration to the indexer so that it only extracts the values for which deepgrep has a search constraint.

UTF8

Until recently, Strigi was not indexing non-ascii characters properly in the CLucene database. Internally, all strings in Strigi are UTF8, but CLucene has to store in UCS2 to be compatible with the Lucene index format and for this reason the strings must be converted before passing them on to the index. I never noticed that this was not happening properly, because I mainly use languages with a 26 letter alphabet. Migi pointed this flaw out to me and now this serious limitation has been fixed. China and Poland rejoice.

deepfind and deepgrep

I did not announce the 0.3.9 release on my blog yet. It's been out for a while and is the first version of Strigi to have deepfind and deepgrep, the applications I proposed at aKademy. These programs alone justify Strigi being included in Vista.

Especially deepgrep is cool. Did you ever feel like grepping through your email attachements, your pdf files or your office documents? Now you can!

xmlindexer

Xmlindexer, like deepfind and deepgrep, is another variation on the theme of exploiting Strigi's libraries. Xmlindexer walks through a directory and outputs an XML file containing all the metadata and text it can extract from the the files it encounters. This means that the Strigi's powers of data extraction are now available to all applications that can parse XML simply by calling xmlindexer and parsing the output.

Standardization

Freedesktop.org's mailing list for standardization has seen some discussion about standardizing on metadata fields and search interfaces. Nothing definite's come of it yet, but the discours is going in the right direction. Mikkel Kamstrup Erlandsen is keeping a running summary of the results.