OCR Those PDFs

reCAPTCHASeveral years back, I more or less stopped making photocopies. In part, my ability to stop adding to the pile of dead tree flakes in my office came about when I moved my class communications online; instead of handing out syllabi or other handouts, I put electronic versions of those documents on our class website.

But the most important factor in my all-but-copy-free workstyle was my department’s lease of a new copier with a powerful high-speed scanner and a network connection. Now, instead of photocopying that chapter I need to read, I scan it and have the machine automatically send it to my email address.

Which is fantastic, of course; now I have those pages that I need to annotate in a highly portable digital format. The only problem is that the PDFs that our copier makes are actually pictures of the pages, rather than text-containing documents. As a result, not only are the resulting files extremely large (and thus not easily emailable), they’re also not searchable, and they can’t be digitally annotated using many common desktop tools, such as Mendeley, which Julie wrote about yesterday.

This situation didn’t bother me that much, though, until I started using iAnnotate on my iPad; suddenly the inability to highlight, underline, and search my PDF library became really annoying. So I’ve set about making those scans searchable and annotatable by running them through OCR.

OCR, or optical character recognition, is a system through which a computer looks at the pattern of pixels in an image and looks for letter forms. This translation of the image of text into actual machine-encoded text is the necessary first step in making scanned-in pages annotatable.

OCR remains somewhat problematic, even after the more than 55 years since the first commercial OCR systems were released. Though most OCR software is capable of dealing quite well with most common typefaces, poor quality copies, distorted text, stray marks, or even things like ligatures (the conjoined “fl,” for instance) are enough to throw the process off. Things have gotten better, however, in no small part due to crowd-sourced training. The image that leads off this post shows one aspect of this training: the reCAPTCHA system, used to ensure that comments online are being left by real humans, works by asking you to input one known word and one troublesome word from a scanned text. Your human OCR thus helps to train machine OCR.

Despite these improvements, if you’re using OCR in order to do any text mining, or to create an authoritative digital edition of a text, you’re likely to have to do a bit of correction. For my purposes, however, OCR works quite well; the vast majority of key terms for which I would want to search scan fine, and to create highlightable text, all I really need OCR to do is figure out where the text is.

There are several different ways to go about the process of OCRing your PDFs. Many scanners allow you to run OCR as you do your scans in the first place; alas, our department copier doesn’t (which is part of why it’s so fast). Moreover, some information organizers, including DEVONthink Pro Office, as Ryan wrote about a while back, will automatically OCR your PDFs. DEVONthink Pro Office, in fact, comes packaged with the ABBYY FineReader engine, which represents more or less the state of the art in contemporary consumer OCR software.

What I was interested in, however, wasn’t organizing my PDFs—I have a filing system that works quite well. I just wanted them to be searchable and annotatable. Adobe Acrobat Pro handles this task quite well, it turns out. Admittedly, Acrobat can be pricey; I certainly wouldn’t have paid for it separately, but it came as part of the Adobe CS5 suite that I recently picked up at a discounted academic price. In Acrobat, by selecting the Document > OCR Text Recognition > Recognize Text Using OCR menu item, I’m able to extract the text from any set of PDF images. The result is a file that can be searched and marked up using standard PDF tools.

Better still, Acrobat 9 makes use of the ClearScan OCR engine, which produces smaller files. By using the “Save As” command to overwrite the existing file (rather than simply saving), I’ve managed to get just about every PDF that I’ve OCR’d in Acrobat to take up about half the space it previously did—something I’d never have expected from a software package that I was accustomed to find bloated and unwieldy.

And even better than that, just as I resigned myself to a painful process of opening a PDF, running OCR, doing a Save As command, and closing the PDF—repeat ad nauseam—I discovered a freely downloadable AppleScript droplet for batch OCRing.

So I’m working, a small batch at a time, on making all of my image-based PDFs machine-readable, and am really happy to find that, in the age of the tablet, an already useful file format is becoming even more useful to me.

But what about you? Do you have a way of processing your PDFs that renders them more usable? Share your suggestions in the comments!

[Image by Flickr user Vitor Lima; Creative Commons licensed.]

Return to Top