PDF without Adobe: 15 macOS can sanitize your PDFs

I’ve been drawing attention to the danger of hidden and orphaned content being left in PDF documents, something which frequently catches people out when they release those documents, only for someone to discover embarrassing secrets left within the files.

Two techniques are available to sanitize PDF documents: as with most document formats, using the Save As command usually forces the app to write all the data out afresh, and more specifically for PDF, some apps (including both PDF Expert and PDFpenPro) offer commands to write ‘flattened’ versions of the document.

The greatest difficulty facing even the expert user is that of checking whether a PDF document has been sanitized, or still has sensitive information remaining in it. I showed how even Adobe Acrobat (Pro) DC’s structure browser is based on what is listed by the document Catalog, and can readily miss orphaned data left in a previously edited file.

I have been adding features to Podofyllin to help the user check for orphaned content, and a couple of days ago had modified its source code to display the PDF file in ASCII format in the app’s Source window, when I discovered something very odd. So odd that I thought my file system had broken.

There are two ways of accessing the raw data in a PDF document. One is to take the data exactly as read from the file, the other using an instance method of the PDFDocument class, dataRepresentation(). Checking the minimalist documentation provided by Apple, I thought that using the latter might reduce memory usage, so was calling that and converting the data into ASCII format to display in the view.

pdfdatarep01

To check that this worked properly, I opened some of my test files, which included some containing orphaned objects, only for those objects to vanish from the PDF source. For about half an hour I panicked. I couldn’t understand why opening those same PDF files in BBEdit showed that the orphaned objects were still present, but in Podofyllin they were not only gone, but the whole file appeared different. I thought that my file system was broken, and was opening two quite different files.

The explanation is that PDFDocument.dataRepresentation() doesn’t show the raw data in the original PDF file at all. When macOS opens the file and turns it into a PDFDocument, it effectively re-images the whole PDF, and what you access in PDFDocument.dataRepresentation() is the PDF of that new document image. This strips out orphaned content very effectively: if an object doesn’t get drawn into the fresh image of the PDF, then it doesn’t survive into the data returned by PDFDocument.dataRepresentation().

I have checked Apple’s old documentation, now archived, even the sole book about Quartz 2D and PDF in macOS, which is now out of print but available electronically in Apple’s Book Store. Nowhere is there any mention of this, and I have searched in vain for further information on the Internet.

The next step was to open the PDF source both ways, as that read from the original file, and from PDFDocument.dataRepresentation(). Sure enough, PDFKit is effectively sanitizing each PDF document which it opens. I believe that this is what is used by both PDF Expert and PDFpenPro to ‘flatten’ PDF documents, and should be a robust way of sanitizing the great majority of PDFs. (It may well be possible to produce deliberately obfuscated PDFs which bypass this, but for non-malicious documents this should suffice.)

pdfdatarep02

This calls for a new version of Podofyllin which:

  • displays both on-disk and re-imaged source in the Source window; these are truncated to 500 KB at present;
  • offers a command to export a document in ‘flattened’ PDF form;
  • offers a command to export the contents of the text window to a text file;
  • explains to the user how to use these to sanitize PDFs.

These are now incorporated into version 1.0b6 of Podofyllin, available from here: podofyllin10b6
and from Downloads above.

Although the whole PDF source can be very interesting, what I am looking at next is parsing it to help the user spot orphaned objects and other problems, maybe even to enable repair of some damaged PDF files.