Context Followup

Thursday, 14 April 2005 | Scott Wheeler

I'm too lazy to register to post comments, so I'll just go through Derek's post more or less point by point.

This isn't tenor, I haven't looked at the code. Search and pattern matching is a fascinating intellectual exercise, and here is the product of my feeble ruminations.

First some definitions. A token is a word, url, filename, mimetype, time, and all the other metadata that you would like to know. A document is a collection of tokens, most obviously an email, a web page, a pdf, a file that you composed, an appointment, etc. The list is large and field for much thought.

In the Tenor framework those are called nodes or in explanatory notes resources; Tokens have a slightly different meaning in there.

Nodes are connected via links. Both may have arbitrary sets of (namespaced) properties and those properties provide domain specific information about each.

So you have a large collection of documents on your hard drive, coming in and going out over network interfaces. First is to build an index, token -> document. A token query would return a list of documents. Here some magic to deal with verb tenses, plurals and possibly synonyms would be helpful. In the end, you would provide a query with a token or list of tokens, getting back a list of documents.

To find the exact document would require replicating it in the query.

No, it doesn't. Actually indexing algorithms are very similar to compression algorithms. Even using a very simple scheme you build a vocabulary where you represent each word as an integer and never repeat those. You then store a second integer for the word position.

In practical text searching algorithms you're also able to throw out the most common words because they provide very little that's useful. In Information Retrieval circles these are called "stopwords".

Grammars are much more difficult and kind of a pandora's box. In the most effective search of course specific grammars are considered; this however isn't practical in an open system like KDE. There are just too many of them. Fortunately most of the popular search engines know very little about grammar as well, so we'll have good company largely ignoring it.

Tokenization of texts is one point where a minimal concern for locales is required -- splitting words. English is a pretty easy language for that, but it breaks down a little bit in languages where there's not something so convenient as a space between words or where things are frequently concatenated.

For instance, German is famous for running together strings of adjectives and nouns. My favorite encountered recently in the local transit website is Halbjahreskartezuzahlungsfrei. That's "half-year-card-payment-free", roughly. The most interesting segment is probably "karte", but getting to that is a bit hard.

Also there are things like Slavic and Semetic languages which tend to combine articles and nouns. The "al" at the beginning of Arabic nouns is just "the".

Endings are another issue (plurals in many languages), but they're easier to deal with using what's called a prefix trie (character tree); suffix tries can also be created to match words from the back.

At any rate -- the goal is to come up with something that works as often as possible, and the tokenizer will have to be plugable based on locale. The default should work reasonably well for many languages, but fine tuning will take time. Fortunately we don't need it to be perfect to be useful.

Getting to this point is to use two or three decades worth of research into document retrieval systems. Still not enough for our purposes.

Fortunately the field is pretty mature and most of the research is pretty accessible. I started kicking around these ideas about a year ago; the reason it's been pretty quiet up until now is that I spent a large portion of that time reading. I've got a handful of books and a couple dozen papers that I've been through on information retrieval, semantic webs, web search algorithms (PageRank, notably) and have also been brushing up on my graph theory. Lately I've added cognitive psychology where I'm looking at generalized stuff on how recall and recognition work in our heads as well as conceptual closure and all that.

I've talked to a lot of academics in the last year. I've got a pretty good network of them established. One of the things that I've never been told is that this is "too hard".

If you're interested in the non-contextual elements Modern Information Retrieval by Ricardo Baeza-Yates is a pretty good introduction to the stuff. I also had the chance to meet with him to talk about this stuff when I was in Chile six months ago. Generally a nice fellow as well. :-)

Take this interlude to upgrade storage and processor capacity. You'll need it.

No, you won't. Hard drives are big these days. Really big. One of the amusing things that Matthias said to me after I talked to him the first time about this stuff was, "Our hard drives are huge these days -- hundreds of gigbytes, and the most creative thing we've come up to do with them is store thousands of stolen MP3s." While that's something of an oversimplification, he's dead on.

The amount of text that we have on our computers is really pretty small. Compressed (actually the compression would get better with more text -- I've just tried it with one copy and done the math) you could fit around 2 million copies of Hamlet on my hard disk. And considering that indexing it takes significantly less space than the originals we have little to worry about. Even if we at some point get into multimedia indexing (not an initial target, but who knows) things like acoustic fingerprints are pretty small.

To be continued...

Definitely. ;-) The main thing here is that this isn't something I went into completely cold; there's still a huge body of stuff out there that I haven't digested, but I've been working pretty hard on assimilating the most important stuff that I can get my hands on.