Google can now OCR all PDFs
When you scan a document, your computer interprets this data as an image. You can see the words on the screen, but your computer doesn’t. As far as your computer is concerned, the letters could be birds or your child or a boat.
When you put this scan up on a website, search engines haven’t been able to index any of the content of your documents because it didn’t recognize the text as text … until now.
Google has a new system that scans Acrobat PDFs on the web for words using Optical Character Recognition (OCR). Similar to its process for using OCR to detect words in PDFs that have already been OCR processed, the new system will do the same for scanned documents posted online that haven’t yet undergone OCR.

If you have scanned PDFs and are interested in having them converted into text, you can upload the images to your website and take advantage of this service.
Simply follow the instructions for how to use Google OCR from the Digital Inspiration website:
Once done, type the query “site:abc.com/pdf filetype:pdf” [into Google] to see the PDF documents as HTML.
Lifehacker recommends using “Google’s Webmaster Tools to reign in what gets scanned and indexed on your site, although you should assume anything you put online can be found by those looking for it.”
This is a really terrific way to get rid of paper clutter in your work space and in your home since you can now see the words in your scanned documents.
Popularity: 14% [?]




12 comments posted
Posted by sammy - 11/08/2008
Um… I guess it goes without saying that you shouldn’t do this with documents you want kept private? For instance, I save my tax returns as PDFs.
Posted by Geekmoose - 11/08/2008
Mac OS X will also OCR PDF files so you can search them using spotlight. Has done since 10.3 at least !
Posted by Tanja - 11/08/2008
Geekmoose: As far as I know, Mac OS X does not do OCR (I tried making it do it, but my png’s converted to pdf never got OCR’d). What Mac OS X does do is index pdf files that have the text embedded. For instance, if you print something from say Word or a website, and you select the ’save as PDF’ option, it will embed the actual text.
Spotlight can read this and this makes the PDF searchable. A PDF only containing image (thus having just one layer) never gets indexed. I’m a bit fuzzy on the details of how PDF’s work exactly, but I think they have one or more layers, one of them always being an exact image of your scan / print, and others possibly the exact text and an index and such.
It could be my systems, though it happens on three different computers which makes me think it just works like that.
Sorry about the long comment
Great tip for public info!
Posted by Luke Gedeon - 11/08/2008
Geekmoose,
Does that work for jpg too? I have gigs of docs scanned to jpg. Does it really OCR or just index the text stored in PDFs? PDFs can store formatted text and images both.
All and Anyone,
Anyone know of a free option to OCR gigs of jpg’s without making them public for Google to process?
Posted by Zora - 11/08/2008
I’ve spent 5 years volunteering at Distributed Proofreaders, where we *correct* OCR of images in order to make free ebooks. Uncorrected OCR can be grossly inaccurate. If it’s based on plain text, recent edition, high-quality scan, crisp print, it will probably turn out OK. If it’s in any way unusual, oddly formatted, old, or low-quality, you’ll get gibberish. In either case, OCR may give you what we at DP call “scannos”: “be” for “he”, “arid” for “and”, “clown” for “down”, etc. Until we have strong AI, only the alert human eye will catch the errors.
Be warned.
Posted by Fit Bottomed Girls - 11/08/2008
I heart google. They’re amazing.
Posted by Jack - 11/09/2008
Luke, you might want to try Evernote. They do OCR for images, are private and can be kept entirely on your computer if you want, and have the best recognition I’ve come across. It’s where I save my product receipts, for example.
Posted by jon - 11/09/2008
Sammy, it does not go without saying, and I think the original article should be clearer. If you scan personal documents to store online for easy access for you, you are now making them easy access for anybody else, unless you take steps to protect them. The warning should come before the instructions and the warning should be much more explicit and bolder.
Posted by DanGTD - 11/12/2008
Nice tool, but having to wait for the spiders will consume time. I wish Scribd would add this and convert in real time to text.
Posted by Molly - 11/12/2008
Ooh, I just tripped over another way! It may not be great for searching, or for transforming entire documents, but if you upload clear PDFs to Google Docs, selected text can be copied and pasted into another document. This’ll be dead useful for me writing research papers, and probably save a lot of retyping time for others.
Posted by Steve Hannah - 11/01/2009
Cool tip. If you’re using Mac OS X you can use PDF OCR X to convert your PDFs to text also. It uses Google’s Tesseract OCR engine for its conversions so it should produce more or less the same results as Google.
http://solutions.weblite.ca/pdfocrx
Posted by Système D » The mysterious data mines of Argleton-on-Google - 11/03/2009
[...] OCRed address, meant to be “Aughton” but transcribed as “Argleton”. We already know that Google is OCRing PDFs as it crawls them; or maybe it was OCRed before being uploaded to the web. No [...]
Post a comment