Is PDF Actually That Complicated?

An illustration of PDF.
Illustration by Geekswipe, CC BY 4.0

We’ve cursed at it. We’ve tried to scrape data from it. We’ve even wondered why a seemingly simple document format requires a 1,000-page ISO specification.

The Portable Document Format (PDF) is more like the cockroach of the digital world that’s been surviving the dot-com bubble, the mobile revolution, the shift to cloud computing, and now the AI era. It’s ubiquitous.

We tolerate PDF, until we’re tasked with extracting text from it. That’s when we quickly realize it’s less of a document and more of a Jurassic Frankenstein of a monster. Yep! Today at work, I had the unfortunate experience of extracting data from a really old PDF!

I’ll write an in-depth explainer some day, but here’s the overview of how it all works compared to a regular document.

How PDF works?

When you open a Word document or a text file, you are looking at a stream of characters. The software flows that text onto the screen based on the window size, the font settings, and the margins. It’s fluid.

But PDF?  It’s not fluid. PDF doesn’t care about the screen size. PDF is obsessed with one thing and one thing only, which is absolute visual fidelity.

Under the hood PDF has no paragraphs or sentences. It’s all commands. You can think of a PDF as essentially a specialized programming language based on PostScript.

Instead of “Here is a paragraph of text,” a PDF says would say a sequence of instructions like these:

  1. Move the cursor to X: 120, Y: 450.
  2. Load the font ‘Helvetica-Bold’.
  3. Set the font size to 12pt.
  4. Draw the glyph for ‘H’.
  5. Move the cursor 5 pixels to the right.
  6. Draw the glyph for ‘e’.

PDF doesn’t know what a word is, let alone a paragraph. It just knows where to draw glyphs on a digital page. So from a PDF reader, when we try to copy and paste text, it is essentially guessing the reading order based on the physical proximity of those glyphs. And if the creator had used a custom font without embedding the Unicode mapping? Good luck! You’re extracting wingdings!

PDF is a monster

Adobe wanted everything with it and it made PDF into a monster. It wanted PDF to have the ability to embed fonts, images, interactive forms, and even run JavaScript. And no it doesn’t stop here. It also wanted audio, video, 3D models, digital signatures (okay this one is cool), and encryption layers.

This is why parsing a PDF is not a weekend project. It feels exactly like parsing something that has accumulated 30 years of technical debt. If you want to see the madness yourself, check out the PDF 32000-1:2008 specification. It’s light reading at 756 pages.

But hands down it’s doing the job what it is intended to do. A PDF generated today will look exactly the same 50 years from now, regardless of the operating system, the device, or the software reading it. That’s why it’s winning. You can’t say the same about HTML.

While we’re stuck with it, we might as well learn about it soon. See you with an in-depth explainer soon.

First published Jan 16, 2013.

We totally get why you have an ad blocker. If you enjoy reading Geekswipe, turning it off for us helps keep the site alive and the science coming.

276 articles

Aeronautical engineer, product builder, developer, science fiction author, and an explorer. I'm the creator and editor of Geekswipe. I love writing about physics, aerospace, astronomy, and technology.

More by Karthikeyan KC

Leave a comment

Only used to notify you of replies. Never published.

Related