Portable Meta-Information continued

Tuesday, 31 March 2009 | Oever

In a recent blog, David Nolden talks about transferring user-generated, file associated meta-data. His post was well written and the ensuing discussion interesting. I'd like to continue his line of thought here.

Anything the user creates on his machine should be easy to archive, synchronize, share, query and protect. Most data these days is stored as separate files. The idea of the semantic desktop is to make content depend less on files. The file is just the container of the content. The content consists of various concepts and relations between the concepts. Also, there are bigger data blobs, usually multimedia files, but also industrial data types like measurement results, logging files and databases.

As long as nuggets of data are separate entities, this is fine. But what do you do with small annotations to files such as ratings, tags, or source logs? The Nepomuk solution is to store this data in a central database. David promotes the use of companion files to the content files. This type of file is called a sidecar file. To speed up meta-data querying, a central index is required.

This is an interesting approach even though it is not a complete solution. I'd like to emphasize that the requirements for meta-data are the same as those for normal data. Both are data and both are valuable.

Let us look at an example of extensive meta-data use. We forget for a bit the details of how to store and manipulate data and simply imagine we have only data and storage media. On storage A, there is an audio track. The audio track was recorded at a certain time and place by a certain person. Recorded are sounds created by other persons. The owner of storage A, Mrs X, records her opinion about the track in a couple of paragraphs of English prose and as a collection of grades for some aspects of the track.

Mr Y is permitted to lend the storage of Mrs X. Mr Y records his opinion about and listening behavior of the track on his personal storage disk, B. He shares his opinion of the track with Mrs X, but keeps the information about where and when he listened to the track private. This follows the policy he has laid out on his personal devices with respect to audio tracks lent to him. His audio player enforces this policy for him. Mr Y also makes a link between the track of Mrs X and an audio track he once borrowed of Mr Z. In the link he stores a comment about certain similarities between the tracks.

In the meantime, Mrs X keeps listening to the track on her second mobile storage C. When Mr Y returns storage A, Mrs X synchronizes it with C and makes a backup of C to her media tank, D.

There are many technical issues that must be solved to enable a relatively simple scenario like the above.

X and Y must both be able to handle the audio protocol in which the track is recorded
X and Y must use a common ontology for the prose review, grade points and gradable categories
X and Y must both use software that understands the way in which the data and meta-data is stored on storage A
Y must be able to store meta-data about data that is no longer in his possession
Y must be able to define a filter that influences the way meta-data is synchronized across different data storages
Y must be able to store relations between two or more data items
X must be able to synchronize and merge different meta-data items with meta-data specific rules.

Using sidecar files does not solve any of the above issues on its own. This does not mean that it is a bad idea. It just means that it is not a complete solution. There is still an awful lot of functionality required in addition. Sidecar files allow programs that know nothing of meta-data or ontologies to keep that data intact during copying of directories. But that is not nearly enough for handling meta data. Hence sidecar files will always be an incomplete solution.

Another interesting and also incomplete solution is a proposal on live.gnome.org. It is recommended reading.

It is clear that we need way of handling data that goes beyond what the current file systems can do. What alternatives are there? We could keep all data in a triple store which can also be exposed as and synchronized with traditional file systems and might even be a traditional file system underneath.