October 29, 2015
Last week I participated in the annual PDF Technical Conference in San Jose, California. I gave a talk about EPUB but my bigger interest was to check-in on the state of the PDF ecosystem. Clearly, PDF still holds a dominant position in electronic documents, which continue to be critical to communicating information. But, as I see it, thanks to fragmentation of solutions and lack of take-up of advanced features, the PDF ecosystem is more and more focused on PDF’s roots in representing final-form pages, rather than on addressing new requirements for interactive, mobile-ready, accessible content. This has major implications for the future of PDF, and for the future of EPUB and Web documents.
PDF… It’s Still A Thing… a BIG Thing
PDF is now over 20 years old, and may seem like yesterday’s news to folks who publish everything online in HTML via CMS’s or Wikis, or who distribute all their commercial content in EPUB and haven’t thought of PDF as an eBook format for years. And anyway the “action” in ICT is in mobile apps, video and Big Data – certainly not documents.
But, as the conference keynotes emphasized, the vast majority of the content that enterprises and other organizations generate and distribute on a daily basis are still documents, and PDF is still the easiest and most popular way to distribute documents that can be consumed everywhere. Not only is there more content on the Web and on corporate servers in PDF than in any other format – billions and billions of documents - but online publishing in PDF is still growing. Every major desktop and mobile operating system has PDF viewing built-in, and PDF is second only to HTML as a document format on the Web. PDF is the primary format for account and transaction statements, for contracts and other documents of record, for white papers and reports, scientific articles, presentations… the list goes on. PDF is of course still the dominant format for workflows that end in printed output – from prepress professionals to consumers and everyone in between – and going the other way for moving workflows that are still paper-based into digital, something that’s still very much a work-in-progress for many enterprises and government agencies around the world
So while we may take PDF for granted, the format remains a foundational building-block of modern ICT. It’s become part of the background, but it definitely hasn’t faded away.
And, while this year’s conference was held at Adobe’s global headquarters in Silicon Valley, PDF has metaphorically left the building: it’s no longer fair to think of it as Adobe’s proprietary format. Adobe Acrobat is no longer the primary creator of PDFs: Microsoft Office and many other programs can directly export PDFs, as can operating system print-drivers. And, even more critically, for many people and on many platforms Adobe Reader is no longer the only or even primary PDF viewing application. Even Adobe’s own mobile Reader has a totally different codebase than Adobe’s desktop Reader. So PDF is no longer just about Adobe Acrobat generating content for Adobe Reader to view. And, the upcoming version 2 ISO Standard has truly been collaboratively developed – the specifications are no longer a rubber-stamp of Adobe’s latest Acrobat features (as was the case with the original ISO 32000 eight years ago).
PDF Moving Forwards … Or Backwards?
But while this broad industry support has helped proliferate PDF into a foundational building-block of document communication, something everyone from knowledge workers to consumers can take for granted, the resulting fragmentation of PDF creation and viewing solutions has created a serious “least common denominator” problem. Content authors who use features only supported in Adobe’s desktop Reader risk breaking the brand promise of “view and print anywhere”. So as a result they don't use leading-edge features. And as a result of that, there’s not much pressure on PDF viewing apps to add support for them. This chicken-and-egg problem, combine with the reality that what people want to do with PDFs most of the time – view and print – has been available from day one – has kept the de facto interoperable PDF format stuck at a pretty low level of functionality. And the lack of formal validation and reading system conformance tests for PDF (which since it is built on its own specialized binary format has no formalized schema and can’t leverage validation tools available for standard building-block formats like ZIP and XML) has exacerbated the problems of the murky definition of de facto interoperability of PDF content.
This was mentioned many times during the conference and can be seen most starkly in the two biggest pieces of news coming out of the event:
- The most substantive change in Version 2 of the ISO Standard for PDF is deprecation of the XML Forms Architecture that Adobe introduced way back in 2003 (“XFA” - and it's telling that Adobe's website still refers to it as "Adobe XML forms"). This will leave only a legacy and rather primitive fillable-forms technology (“AcroForms”) as a first-class part of the PDF specification (and AcroForms is not universally or consistently supported across all PDF viewers).
- The next subset profile of PDF was announced at the conference and it will be the most minimalistic ever, PDF/R aka PDF/Raster which, exactly like it sounds, strips down a PDF to just a set of image files plus metadata, designed to be generated directly by scanners.
There was also quite a bit of angst expressed about PDF/UA, the accessibility-oriented profile of PDF…. While it looks like Adobe’s clout will likely succeed in getting the upcoming refresh of Section 508 compliance to OK use of PDF/UA, various conference speakers expressed concern because almost no PDF files are compatible with it, most PDF creators don’t generate and most PDF viewers don’t utilize the structural information it’s based on (“Tagged PDF”), including Adobe’s own mobile Readers. The most generous estimates were that only 10% to 15% of PDF files have any structural tagging, and the lead developer of Apache PDFBox admitted “I very rarely encounter Tagged PDF”. The upcoming PDF Version 2.0 makes major changes to Tagged PDF to make content theoretically more accessible… but since the original Tagged PDF is poorly supported, there’s no guarantees that the new flavor will gain adoption, and it might even confuse things further.
PDF: Billions of Documents, Handfuls of Experts
Another key point about the PDF ecosystem is that it’s very small and, well, quite mature. Last week’s event, the largest annual conference for the PDF community, had well under a hundred attendees. Every speaker on the program seemed to be a vendor or solution provider of some kind… one after another proclaiming that they had decades of personal experience working with PDF. There was not only a shortage of fresh blood there was a shortage of actual users, not only on the program but even in the audience.
PDF and the Future
My perspective is surely colored by both my past – having worked on establishing PDF as the ePaper standard – and my present – working on establishing EPUB as the next-generation portable document format for the Open Web Platform. But as I see it, PDF tried and failed to evolve past its roots in the final-form page… and its ecosystem is now re-focusing on that core value proposition. That's probably a good thing, as attempting to graft on logical structure to content that's already been “typeset at the factory” was always a questionable proposition. However, the increasing demand for accessible, mobile-ready, machine-processable content means that PDF may less and less often be seen as the right solution for digital-native documents (i.e. content that didn’t come out of a scanner, and is probably going to be read and stored digitally rather than being routinely printed).
John Warnock’s original Camelot paper in 1991 inspired me personally, and a highlight of my career was getting the chance to help to make his vision a reality. But the Camelot paper also contains the seed of PDF’s limitation – the problem he was concerned with was “the ability to communicate visual material between different applications and systems” (emphasis added). The goal he set out was to supersede the FAX machine – and we succeeded, with a success arguably beyond anyone’s expectations (except maybe John’s). But documents aren’t just (or even mainly) visual material. The Camelot paper is, ironically, a case in point: it's just paragraphs of plain text in a default font, yet as a PDF it's inaccessible, and impossible to read even on a large-screen smartphone. Google’s mission to “organize all the world’s information and make it universally accessible and useful” is a part of the mission of many enterprises and organizations. But images of pages simply aren’t as accessible, and aren’t as useful, as more structured representations of content.
Of course the FAX machine is still with us and so will be PDF, for a long, long time to come. But I believe the next phase of portable documents for the Web will need to take us well beyond the print-replica PDF. The good news is that many in the PDF community realize that the original binary PDF format isn’t the be-all and end-all, and it seems a number of folks in that community share a bigger vision for a broader portable document platform that can embrace multiple serializations in service of the bigger goal of making information more accessible and useful. We're already seeing some interesting crossovers, from services to automatically convert PDF to EPUB to the first EPUB print-on-demand solutions. Despite some trepidation on my part, my talk on EPUB at the PDF conference was well-attended and seemed to be well-received… anyway, no vegetables were thrown in my direction. So far, anyway.