Skip to content

Fun with Microsoft OOXML

Monday, 10 March 2008  |  Dipesh

It is one case to walk through 7000 pages of rather technical documentation and to try to extract something useful out of it for a concrete question. It is another thing to look at the actual XML produced by the Microsoft Office 2007 suite.

There we have the workbook as main entry-point for spreadsheets aka for what MS Excel flushes out. Such a workbook does contain normally general informations about the sheets, file-revision and so on. Now we know that something that may become ISO standard is at least vendor and application independent to be used by more then those who pushed for that standard, right? And hey, it's all XML (except things like the binary Printer-driver embedded in each OOX-document). So, it's open, right?

Within this main and initial entry-point I did run into a XML-tag that looks like this;

That's one of the very first tags someone has to deal with if he likes to do something with that format. So, 2 attributes that sound mysterious. But hey, that's why the 7000 pages specs are there, right?

So, on page 1919 of the "Office Open XML Part 4" PDF-document that is around 40MB big and does freez my rather new AMD64 dual-core with 2GB of RAM for several minutes, is the description of both attributes.

The section about "filterPrivacy" says; Specifies a boolean value that indicates whether the application has been inspected the workbook for personally identifying information (PII). If this flag is set, the application warns the user any time the user performs do an action that will insert PII into the document. For example, inserting a comment might inserts the user's name.

Oha. So, it's another boolean flag and describes what the application should do during editing (hint: it's a file-format and not a guide how to implement the application itself). To be able to load+save that flag and those PII thing, I would need to know now more details what PII exactly is, where it's stored and how I am able to load it. But at none of the 7000 pages are any details about this :-( Fine, only Microsoft knows...

Okeli, let's give up on this one. Well, to be able to load 50% (so 1 of 2 attributes) should be enough and it's still XML and open, right?

The section about "defaultThemeVersion" says; Specifies the default version of themes to apply in the workbook. The value for defaultThemeVersion depends on the application. SpreadsheetML defaults to the form [version][build], where [version] refers to the version of the application, and [build] refers to the build of the application when the themes in the user interface changed. The possible values for this attribute are defined by the XML Schema unsignedInt datatype.

Oha again. Not only do I wonder about the unsignedInt datatype (in fact I didn't know before that XML is so much C/C++ like), but did I got it right, that those potential ISO-standard does contain details about "MSOffice Themes"? Wow, now that's really vendor-neutral and no ISO-standard should come without this! What a great idea to just just append all of /etc including there man-pages into something like e.g. the OpenDocument-specs. Man, we could blow up that documentation by a factor of 10 at least and everybody would waste there time by sorting those things out too!

I don't get why such application-dependend details are all over the place in the MSOOXML-specs. Would it be such difficult to at least extract them from the really useful things someone is able to implement? I mean, why should an ISO-standard contain such totally unimportant details only one vendor is able to implement?

Please ISO, don't push such trash on us. Everybody who's able to read those specs will see that they are just not ready yet. Do yourself and others a honor and abort the fasttrack-process. Let those specs go the regular way OpenDocument went through too. This really helps to improve the quality.

Updated: The Internet continues to be an impressing medium. Someone did point me to the Wikipedia article about PII. Now I am one step future, though not in the direction of a solution since I still need to figure out what it may mean in the context of this specification and most important: how to implement it the most compatible way? Also interesting is, that unsignedInt as well as boolean are part of XML Schema. Any again something learned :)

Update 2008/07/07: and as it turned out, the ISO was according to wikileaks aware that fasttrack was "the wrong thing". Anyway, I trust in them that this will have no consequences :-/