Skip to content

Rediscovering trie's, reducing DOM memory usage

Monday, 12 March 2007  |  koos vriezen

Some time I'm thinking about ways to get rid of the many string comparison when playing a SMIL file in KMPlayer. The flow of a movie is mainly written in tags and attribure values. These documents have hardly any textnodes, so tags+attributes are about halve of the content, remaining is then the attribute values. So sharing these strings for tags/attributes would also be a great memory saving, making the DOM grow at 50% speed.

Idea of creating a map from ids to string doesn't look the right approach because either I should ignore unknown tags or store the string with an id. Yesterday I thought about using a trie and implemented today a basic TrieString, that allows me to share all occurences of a particular tag or attributename. The class looks like: class TrieString { TrieNode * node; friend bool operator == (const TrieString & s1, const TrieString & s2); public: TrieString (); TrieString (const QString & s); TrieString (const TrieString & s); ~TrieString (); TrieString & operator = (const TrieString & s); operator QString () const; }; TrieNode is a private class that looks like struct TrieNode { TrieNode (const char * s); void unref (); char * str; unsigned short length; unsigned short ref_count; TrieNode * parent; TrieNode * first_child; TrieNode * next_sibling; }; And one global root trie node. TrieString is only for storage and comparison. When some fancy string operation is needed the QString operator can be used. The == operator is inline bool operator == (const TrieString & s1, const TrieString & s2) { return s1.node == s2.node; } Which is really fast :-).

The trie is based on a compact trie, where the node in the above class is the last node from the string (which makes the QString construction indirect but straightforward, simply recurse to the root, calculate length while going up, allocate the buffer on the top and strcat the chunks while unwinding the stack). A small test program, int main (int, char **) { TrieString s1; TrieString s1_1(QString ("region")); s1 = s1_1; TrieString s2 (QString ("regionName")); TrieString s3 (QString ("regPoint")); TrieString s4 (QString ("regAlign")); TrieString s5 (QString ("fill")); TrieString s6 (QString ("freeze")); TrieString s7 (QString ("fit")); TrieString s8 (QString ("fontPtSize")); TrieString s9 (QString ("fontSize")); TrieString s10 (QString ("fontFace")); TrieString s11 (QString ("fontColor")); TrieString s12 (QString ("hAlign")); TrieString s13 (QString ("region")); TrieString s14 (QString ("ref")); TrieString s15 (QString ("head")); dump (root_trie, 0); return 0; } outputs: (null) len: 0 rc: 1 ..f len: 1 rc: 0 ....i len: 1 rc: 0 ......ll len: 2 rc: 1 ......t len: 1 rc: 1 ....ont len: 3 rc: 0 ......Color len: 5 rc: 1 ......Face len: 4 rc: 1 ......PtSize len: 6 rc: 1 ......Size len: 4 rc: 1 ....reeze len: 5 rc: 1 ..h len: 1 rc: 0 ....Align len: 5 rc: 1 ....ead len: 3 rc: 1 ..re len: 2 rc: 0 ....f len: 1 rc: 1 ....g len: 1 rc: 0 ......Align len: 5 rc: 1 ......Point len: 5 rc: 1 ......ion len: 3 rc: 3 ........Name len: 4 rc: 1 rc stands for refcount, so the trie can shrink as well (not implemented yet, but ..). The only downsize I see is TrieString construction, in a worsecase, where each letter gets a node, can take long. But first see if this is actually the case, and maybe add some ad-hoc first level binairy search or whatever. This would also be a candidate for the parser when reading a tag or attribute name, because a += operator is cheap (ie. we already have the end node, and when a node has refcount of 1 and no children, than simply append to the string, otherwise add or find a child node for appending. But more likely the TrieString will grow to one already in there. So this eliminates the first time construction penalty. What remains are the construction where the comparison takes place. These should be either cached in static member vars or static vars in functions. Thoughts are to have eg. a compiled in trie w/ most common strings for a start w/ a ref of 1 that never gets zero. Preventing the trie empties after each document load.

I guess no rocket sience, but kept me doing naughty things this afternoon :-)