MSOOXML: Why oh why?

Monday, 14 November 2011 | Dipesh

Some like to make a joke of OpenOffice.org coming with around 8 different string-implementations and comparing that with what we are having in Calligra with QString. But when we worked back then in OASIS to form what later became the ISO OpenDocument standard we left such implementation details out.

A string is a string and its all XML. As minimum I would expect to not find a complete XML data-structure in an ISO standard that 1) is such an implementation detail and 2) does define an own complex data-structure for just one single use-case that could have beem easily covered by an already used data-structure.

MSOOXML does exactly that. I was caught in surprise when discovering following snipped in an MSOOXML XML document;

<c:cat>
  <c:multiLvlStrRef>
    <c:f>Sheet1!$E$3:$G$4</c:f>
    <c:multiLvlStrCache>
    <c:ptCount val="3"/>
    <c:lvl>
      <c:pt idx="0"><c:v>Pass</c:v></c:pt>
      <c:pt idx="1"><c:v>Fail</c:v></c:pt>
      <c:pt idx="2"><c:v>NA</c:v></c:pt>
    </c:lvl>
    <c:lvl>
      <c:pt idx="0"><c:v>Result</c:v></c:pt>
    </c:lvl>
    </c:multiLvlStrCache>
  </c:multiLvlStrRef>
</c:cat>

Compared to the commonly used c:strRef element the c:multiLvlStrRef element defines a multidimensional list of strings. Something that could be easily covered by just using multiple c:strRef elements. But no.

When looking at the MSOOXML specifications we discover those whole rather complex structure is used for exactly one single time for one single use-case;

ECMA-376 page 4060

5.7.2.116 multiLvlStrRef (Multi Level String Reference)

 Parent Elements:
  - cat (§5.7.2.24); xVal (§5.7.2.235)

 Child Elements:
  - extLst (Chart Extensibility) §5.7.2.64
  - f (Formula) §5.7.2.65
  - multiLvlStrCache (Multi Level String Cache) §5.7.2.115

What I had to do was to implement code to parse all that and do exactly what the c:strRef element does. This pushed additional logic and work on consumers of that standard for no good reason. Worst is that this stayed undiscovered in Calligra for a long time and so we could, in some rather random cases, completely ignore categories in charts when importing Microsoft 2007/2010 documents.

Why oh why is that data-structure used? Why not reuse c:strRef like ALL the other parts of that ISO Standard do? Why duplicate the whole formula and cache logic? Why force consumers of that standard to special case exactly one single case rather then unifying this and removing a complete unneeded section plus a complete type-specification from the already rather large standard of >5000 pages?

A proper QA on that standard rather then pushing it in such amazing speed through an ISO fasttrack process would have cleaned up a lot of such cases is my believe. It would have decreased the burden for adoption and improved the overall quality.

Since changing such fundamental things afterwards, means after becoming an official ISO standard, is impossible to do within a maintenance life-circle with a patch release (something like the already published ISO OpenDocument 1.1, 1.2 or the still to publish 1.x releases after them) this has to wait for a long time and would introduce then backwards-incompatible data-structures. Looks as we have to stick now with such mistakes forever :-(