Skip to content
ADVERTISEMENT
Sign In
  • Sections
    • News
    • Advice
    • The Review
  • Topics
    • Data
    • Diversity, Equity, & Inclusion
    • Finance & Operations
    • International
    • Leadership & Governance
    • Teaching & Learning
    • Scholarship & Research
    • Student Success
    • Technology
    • Transitions
    • The Workplace
  • Magazine
    • Current Issue
    • Special Issues
    • Podcast: College Matters from The Chronicle
  • Newsletters
  • Events
    • Virtual Events
    • Chronicle On-The-Road
    • Professional Development
  • Ask Chron
  • Store
    • Featured Products
    • Reports
    • Data
    • Collections
    • Back Issues
  • Jobs
    • Find a Job
    • Post a Job
    • Professional Development
    • Career Resources
    • Virtual Career Fair
  • More
  • Sections
    • News
    • Advice
    • The Review
  • Topics
    • Data
    • Diversity, Equity, & Inclusion
    • Finance & Operations
    • International
    • Leadership & Governance
    • Teaching & Learning
    • Scholarship & Research
    • Student Success
    • Technology
    • Transitions
    • The Workplace
  • Magazine
    • Current Issue
    • Special Issues
    • Podcast: College Matters from The Chronicle
  • Newsletters
  • Events
    • Virtual Events
    • Chronicle On-The-Road
    • Professional Development
  • Ask Chron
  • Store
    • Featured Products
    • Reports
    • Data
    • Collections
    • Back Issues
  • Jobs
    • Find a Job
    • Post a Job
    • Professional Development
    • Career Resources
    • Virtual Career Fair
    Upcoming Events:
    College Advising
    Serving Higher Ed
    Chronicle Festival 2025
Sign In
Profhacker Logo

ProfHacker

Teaching, tech, and productivity.

OCR Those PDFs

By Kathleen Fitzpatrick July 20, 2010

reCAPTCHASeveral years back, I more or less stopped making photocopies. In part, my ability to stop adding to the pile of dead tree flakes in my office came about when I moved my class communications online; instead of handing out syllabi or other handouts, I put electronic versions of those documents on our class website.

To continue reading for FREE, please sign in.

Sign In

Or subscribe now to read with unlimited access for as low as $10/month.

Don’t have an account? Sign up now.

A free account provides you access to a limited number of free articles each month, plus newsletters, job postings, salary data, and exclusive store discounts.

Sign Up

reCAPTCHASeveral years back, I more or less stopped making photocopies. In part, my ability to stop adding to the pile of dead tree flakes in my office came about when I moved my class communications online; instead of handing out syllabi or other handouts, I put electronic versions of those documents on our class website.

But the most important factor in my all-but-copy-free workstyle was my department’s lease of a new copier with a powerful high-speed scanner and a network connection. Now, instead of photocopying that chapter I need to read, I scan it and have the machine automatically send it to my email address.

Which is fantastic, of course; now I have those pages that I need to annotate in a highly portable digital format. The only problem is that the PDFs that our copier makes are actually pictures of the pages, rather than text-containing documents. As a result, not only are the resulting files extremely large (and thus not easily emailable), they’re also not searchable, and they can’t be digitally annotated using many common desktop tools, such as Mendeley, which Julie wrote about yesterday.

ADVERTISEMENT

This situation didn’t bother me that much, though, until I started using iAnnotate on my iPad; suddenly the inability to highlight, underline, and search my PDF library became really annoying. So I’ve set about making those scans searchable and annotatable by running them through OCR.

OCR, or optical character recognition, is a system through which a computer looks at the pattern of pixels in an image and looks for letter forms. This translation of the image of text into actual machine-encoded text is the necessary first step in making scanned-in pages annotatable.

OCR remains somewhat problematic, even after the more than 55 years since the first commercial OCR systems were released. Though most OCR software is capable of dealing quite well with most common typefaces, poor quality copies, distorted text, stray marks, or even things like ligatures (the conjoined “fl,” for instance) are enough to throw the process off. Things have gotten better, however, in no small part due to crowd-sourced training. The image that leads off this post shows one aspect of this training: the reCAPTCHA system, used to ensure that comments online are being left by real humans, works by asking you to input one known word and one troublesome word from a scanned text. Your human OCR thus helps to train machine OCR.

Despite these improvements, if you’re using OCR in order to do any text mining, or to create an authoritative digital edition of a text, you’re likely to have to do a bit of correction. For my purposes, however, OCR works quite well; the vast majority of key terms for which I would want to search scan fine, and to create highlightable text, all I really need OCR to do is figure out where the text is.

There are several different ways to go about the process of OCRing your PDFs. Many scanners allow you to run OCR as you do your scans in the first place; alas, our department copier doesn’t (which is part of why it’s so fast). Moreover, some information organizers, including DEVONthink Pro Office, as Ryan wrote about a while back, will automatically OCR your PDFs. DEVONthink Pro Office, in fact, comes packaged with the ABBYY FineReader engine, which represents more or less the state of the art in contemporary consumer OCR software.

ADVERTISEMENT

What I was interested in, however, wasn’t organizing my PDFs—I have a filing system that works quite well. I just wanted them to be searchable and annotatable. Adobe Acrobat Pro handles this task quite well, it turns out. Admittedly, Acrobat can be pricey; I certainly wouldn’t have paid for it separately, but it came as part of the Adobe CS5 suite that I recently picked up at a discounted academic price. In Acrobat, by selecting the Document > OCR Text Recognition > Recognize Text Using OCR menu item, I’m able to extract the text from any set of PDF images. The result is a file that can be searched and marked up using standard PDF tools.

Better still, Acrobat 9 makes use of the ClearScan OCR engine, which produces smaller files. By using the “Save As” command to overwrite the existing file (rather than simply saving), I’ve managed to get just about every PDF that I’ve OCR’d in Acrobat to take up about half the space it previously did—something I’d never have expected from a software package that I was accustomed to find bloated and unwieldy.

And even better than that, just as I resigned myself to a painful process of opening a PDF, running OCR, doing a Save As command, and closing the PDF—repeat ad nauseam—I discovered a freely downloadable AppleScript droplet for batch OCRing.

So I’m working, a small batch at a time, on making all of my image-based PDFs machine-readable, and am really happy to find that, in the age of the tablet, an already useful file format is becoming even more useful to me.

But what about you? Do you have a way of processing your PDFs that renders them more usable? Share your suggestions in the comments!

[Image by Flickr user Vitor Lima; Creative Commons licensed.]

We welcome your thoughts and questions about this article. Please email the editors or submit a letter for publication.
Share
  • Twitter
  • LinkedIn
  • Facebook
  • Email
ADVERTISEMENT
ADVERTISEMENT

More News

Pro-Palestinian student protesters demonstrate outside Barnard College in New York on February 27, 2025, the morning after pro-Palestinian student protesters stormed a Barnard College building to protest the expulsion last month of two students who interrupted a university class on Israel. (Photo by TIMOTHY A. CLARY / AFP) (Photo by TIMOTHY A. CLARY/AFP via Getty Images)
Campus Activism
A College Vows to Stop Engaging With Some Student Activists to Settle a Lawsuit Brought by Jewish Students
LeeNIHGhosting-0709
Stuck in limbo
The Scientists Who Got Ghosted by the NIH
Protesters attend a demonstration in support of Palestinian activist Mahmoud Khalil, March 10, 2025, in New York.
First-Amendment Rights
Noncitizen Professors Testify About Chilling Effect of Others’ Detentions
Photo-based illustration of a rock preciously suspended by a rope over three beakers.
Broken Promise
U.S. Policy Made America’s Research Engine the Envy of the World. One President Could End That.

From The Review

Vector illustration of a suited man with a pair of scissors for a tie and an American flag button on his lapel.
The Review | Opinion
A Damaging Endowment Tax Crosses the Finish Line
By Phillip Levine
University of Virginia President Jim Ryan keeps his emotions in check during a news conference, Monday, Nov. 14, 2022 in Charlottesville. Va. Authorities say three people have been killed and two others were wounded in a shooting at the University of Virginia and a student is in custody. (AP Photo/Steve Helber)
The Review | Opinion
Jim Ryan’s Resignation Is a Warning
By Robert Zaretsky
Photo-based illustration depicting a close-up image of a mouth of a young woman with the letter A over the lips and grades in the background
The Review | Opinion
When Students Want You to Change Their Grades
By James K. Beggan

Upcoming Events

07-31-Turbulent-Workday_assets v2_Plain.png
Keeping Your Institution Moving Forward in Turbulent Times
Ascendium_Housing_Plain.png
What It Really Takes to Serve Students’ Basic Needs: Housing
Lead With Insight
  • Explore Content
    • Latest News
    • Newsletters
    • Letters
    • Free Reports and Guides
    • Professional Development
    • Events
    • Chronicle Store
    • Chronicle Intelligence
    • Jobs in Higher Education
    • Post a Job
  • Know The Chronicle
    • About Us
    • Vision, Mission, Values
    • DEI at The Chronicle
    • Write for Us
    • Work at The Chronicle
    • Our Reporting Process
    • Advertise With Us
    • Brand Studio
    • Accessibility Statement
  • Account and Access
    • Manage Your Account
    • Manage Newsletters
    • Individual Subscriptions
    • Group and Institutional Access
    • Subscription & Account FAQ
  • Get Support
    • Contact Us
    • Reprints & Permissions
    • User Agreement
    • Terms and Conditions
    • Privacy Policy
    • California Privacy Policy
    • Do Not Sell My Personal Information
1255 23rd Street, N.W. Washington, D.C. 20037
© 2025 The Chronicle of Higher Education
The Chronicle of Higher Education is academe’s most trusted resource for independent journalism, career development, and forward-looking intelligence. Our readers lead, teach, learn, and innovate with insights from The Chronicle.
Follow Us
  • twitter
  • instagram
  • youtube
  • facebook
  • linkedin