Researching the state of PDF manipulation tools in the world of Free Software (1)

Friday, 26 January 2007 | Pipitas

Readers of my blog will know it already: Linux printing is geared to move towards PDF to make it its core spooling and job processing format. (This won't happen over night, and this won't make PostScript printing any harder, so don't worry). That was what the overall consensus was at last year's Linux Desktop Printing Summit in Atlanta, where developers from CUPS, Linuxprinting.org, FreeStandards.org, Freedesktop.org, OpenPrinting.org, OpenUsability.org, Ghostscript, Scribus, KDE, Gnome, Redhat, SUSE, Ricoh, Lanier, HP, Xerox, IBM, Mandriva, Debian, Mozilla and Sun sat together for 3 days, exchanged ideas and discussed how to move forward.

PDF is in some respects the blood child of PostScript anyway. The format has been developed by the same company, Adobe, and it is based on the same graphics and imaging model as PostScript is. PDF though, has been stripped off the features that make PostScript to be a fully-fledged programming language.

On the other hand, PDF's handling of advanced graphic objects, of fonts, of colors, of layers and of transparencies got very much fine-tuned over time.

The internals of a PDF file are quite complicated. The current PDF specification document encompasses 1200 pages (...of PDF, what else?). A PDF is not something that you can simply manipulate at will with a text editor, as much hacker as you may be. Well, PDFs where designed to be un-editable in the first place. They should pin down the page images they represent in a way that makes them print and view on screen in an excellent measure of high fidelity across different devices and computers and operating systems.

That design goal was ... hmm, not entirely reached in practice, as every Prepress professional will tell you. PDF file processing *still* requires a highly specialized knowledge, and a set of rules to be followed in order to make the complete professional printing process chain (from the designer of a page working on a Mac, to the print engineer overviewing a highspeed digital offset press) work reliably: let colors match exactly the shade and tone they are intended to match, and let the fonts look like they should.

So, in practice, the Prepress and DTP people in the industry *do* have an assortment of highly specialized tools that *can* lift the restrictions. They routinely open and manipulate PDF files to repair things that may prevent them printing as exactly as is needed: exchange fonts, remove shapes from individual pages, remove layers, correct typos and what not.

Anyway, as I already said above: PDF file internals are not straight-forward. They are in no way like ASCII text files (rather "flat"), or like XML files (more like "trees") -- they are organized in various elements that do reference each other, and they contain "streams" as specific parts which may discribe various graphic objects that are represented in the file. Even a "simple" PDF viewer is not easy to create. Let alone tool to manipulate a PDF without damaging its integrity....

Now, we don't have many (or even any) Free+free tools for that task yet, have we? The utilities to access a PDF in the way described above (an operation of its pumping heart, so to speak) are by and large only available for Mac OS X and MS Windows -- and they are rather expensive. We, in the FOSS world, can extract pages from a PDF, yes. Ghostscript can convert PDFs into different formats. pdftk can do quite some things in merging PDFs and adding a watermarks to its pages. But that's it. No changing of strings. No change of fonts. For users, no handling of layers. No scaling of individual objects on an arbitrary page. No moving of pictures on a page from one place to the next. No rotating of text boxes. No filling in of forms. No digital signatures for document exchange. (I'm mixing a few different requirements here, and I'm neglecting some rudimentary beginnings of some developments as well.)

No easy to use toolkit for developers either, that allow the creation of high-quality PDF output...

I was not able to follow FOSS developments closely in the second half of last year, so I may have missed a lot of announcements and initiatives. So I decided to turn to that little "gg:pdf manipulation" trick of Konqui and find out. And boy, was I surprised.

There came up, finally, two hits that look extremely promising. I'll describe them in my next two blogs (I need time to go through my notes and do a proper writeup first. So stay tuned.)