~/blog

What your PDF quietly leaks — and how to strip it in the browser

published

#pdf#privacy#metadata

A PDF document icon under a magnifying glass that reveals hidden monospace metadata fields (author, producer, creation date) leaking out, with a black redaction box that has selectable text bleeding through it.

TL;DR

A PDF stores metadata in two places: a small Info dictionary (/Author, /Producer, /CreationDate…) and often a larger XMP packet that duplicates and extends it. Together they routinely expose your OS username, the exact tool and version that generated the file, and the creation/modification times. “Redacting” by drawing a black box doesn’t touch any of this — and doesn’t remove the text under the box either. To actually clean a PDF, rewrite the Info dictionary and drop the XMP stream. You can do it entirely client-side.

The problem

Open almost any PDF someone sent you and read its document properties. You’ll typically find:

/Author    jsmith
/Creator   Microsoft® Word for Microsoft 365
/Producer  Skia/PDF m120          (= printed to PDF from Chrome 120)
/CreationDate  D:20260531142233-05'00'
/ModDate       D:20260601090114-05'00'

That /Author is usually the OS account name. /Producer fingerprints the exact software and version — Skia/PDF m120 is Chrome’s print path, macOS Version 14.5 (Build 23F79) Quartz PDFContext is the macOS print dialog, Microsoft: Print To PDF is Windows. The timestamps say when you made and last touched it. None of this is visible on the page; all of it ships with the bytes.

Two more failure modes that surprise people:

Why it happens

The format was designed for faithful reproduction and provenance, not privacy. Three distinct things people conflate:

ThingWhere it livesVisible on page?Removed by
Info dictionaryTrailer /Info objectNoRewriting/clearing the dict
XMP metadataA metadata stream attached to the document catalogNoDeleting the metadata stream
Redaction “boxes”Drawn content on the pageYes (the box)Actually deleting the underlying text/objects, not covering it

The Info dictionary is defined in the PDF spec’s document information section; XMP is a separate Adobe/ISO standard (a chunk of RDF/XML embedded in the file). A tool that only edits one leaves the other intact — which is why a file can show “no author” in one viewer and still carry the author in its XMP packet.

What to do

To strip metadata (Info + XMP), rewrite the document without them. With pdf-lib this is a few lines and runs in the browser — the bytes never leave the page:

// npm i pdf-lib
import { PDFDocument, PDFName } from 'pdf-lib';

const bytes = await file.arrayBuffer();
const pdf = await PDFDocument.load(bytes);

pdf.setTitle('');
pdf.setAuthor('');
pdf.setSubject('');
pdf.setKeywords([]);
pdf.setProducer('');
pdf.setCreator('');
// Drop the XMP metadata stream as well
pdf.catalog.delete(PDFName.of('Metadata'));

const cleaned = await pdf.save({ useObjectStreams: true });

That clears the Info fields and removes the XMP stream. Saving produces a fresh file, which also collapses prior incremental-update history into a single revision.

To verify, read the metadata back out. ExifTool reports both Info and XMP tags:

exiftool -a -G1 cleaned.pdf

If a field is truly gone it won’t appear; if your tool only touched the Info dict, ExifTool will still print the XMP group.

To actually redact text, you must remove the underlying objects, not cover them. A black box is a drawing, not a deletion. If the words matter, delete the text from the content stream (or rasterize the page and re-OCR only what you want kept) — then confirm with copy-paste and a text-extraction pass that the words are gone.

If you’d rather not write code, bytefork’s PDF metadata stripper does the Info-and-XMP rewrite locally — drag a file in, download the cleaned copy, nothing is uploaded.

Caveats

References