<?xml version="1.0" encoding="UTF-8"?><rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
		>
<channel>
	<title>Comments on: Google can now OCR all PDFs</title>
	<atom:link href="http://unclutterer.com/2008/11/08/google-can-now-ocr-all-pdfs/feed/" rel="self" type="application/rss+xml" />
	<link>http://unclutterer.com/2008/11/08/google-can-now-ocr-all-pdfs/</link>
	<description>Daily tips on how to organize your home and office.</description>
	<lastBuildDate>Tue, 16 Mar 2010 05:16:42 -0400</lastBuildDate>
	<generator>http://wordpress.org/?v=2.9.2</generator>
	<sy:updatePeriod>hourly</sy:updatePeriod>
	<sy:updateFrequency>1</sy:updateFrequency>
		<item>
		<title>By: Système D &#187; The mysterious data mines of Argleton-on-Google</title>
		<link>http://unclutterer.com/2008/11/08/google-can-now-ocr-all-pdfs/comment-page-1/#comment-45060</link>
		<dc:creator>Système D &#187; The mysterious data mines of Argleton-on-Google</dc:creator>
		<pubDate>Tue, 03 Nov 2009 23:19:08 +0000</pubDate>
		<guid isPermaLink="false">http://unclutterer.com/?p=3125#comment-45060</guid>
		<description>[...] OCRed address, meant to be “Aughton” but transcribed as “Argleton”. We already know that Google is OCRing PDFs as it crawls them; or maybe it was OCRed before being uploaded to the web. No [...]</description>
		<content:encoded><![CDATA[<p>[...] OCRed address, meant to be “Aughton” but transcribed as “Argleton”. We already know that Google is OCRing PDFs as it crawls them; or maybe it was OCRed before being uploaded to the web. No [...]</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: Steve Hannah</title>
		<link>http://unclutterer.com/2008/11/08/google-can-now-ocr-all-pdfs/comment-page-1/#comment-44904</link>
		<dc:creator>Steve Hannah</dc:creator>
		<pubDate>Sun, 01 Nov 2009 21:35:28 +0000</pubDate>
		<guid isPermaLink="false">http://unclutterer.com/?p=3125#comment-44904</guid>
		<description>Cool tip.  If you&#039;re using Mac OS X you can use PDF OCR X to convert your PDFs to text also.  It uses Google&#039;s Tesseract OCR engine for its conversions so it should produce more or less the same results as Google.
http://solutions.weblite.ca/pdfocrx</description>
		<content:encoded><![CDATA[<p>Cool tip.  If you&#8217;re using Mac OS X you can use PDF OCR X to convert your PDFs to text also.  It uses Google&#8217;s Tesseract OCR engine for its conversions so it should produce more or less the same results as Google.<br />
<a href="http://solutions.weblite.ca/pdfocrx" rel="nofollow">http://solutions.weblite.ca/pdfocrx</a></p>
]]></content:encoded>
	</item>
	<item>
		<title>By: Molly</title>
		<link>http://unclutterer.com/2008/11/08/google-can-now-ocr-all-pdfs/comment-page-1/#comment-23599</link>
		<dc:creator>Molly</dc:creator>
		<pubDate>Wed, 12 Nov 2008 16:29:37 +0000</pubDate>
		<guid isPermaLink="false">http://unclutterer.com/?p=3125#comment-23599</guid>
		<description>Ooh, I just tripped over another way! It may not be great for searching, or for transforming entire documents, but if you upload clear PDFs to Google Docs, selected text can be copied and pasted into another document. This&#039;ll be dead useful for me writing research papers, and probably save a lot of retyping time for others.</description>
		<content:encoded><![CDATA[<p>Ooh, I just tripped over another way! It may not be great for searching, or for transforming entire documents, but if you upload clear PDFs to Google Docs, selected text can be copied and pasted into another document. This&#8217;ll be dead useful for me writing research papers, and probably save a lot of retyping time for others.</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: DanGTD</title>
		<link>http://unclutterer.com/2008/11/08/google-can-now-ocr-all-pdfs/comment-page-1/#comment-23574</link>
		<dc:creator>DanGTD</dc:creator>
		<pubDate>Wed, 12 Nov 2008 13:00:44 +0000</pubDate>
		<guid isPermaLink="false">http://unclutterer.com/?p=3125#comment-23574</guid>
		<description>Nice tool, but having to wait for the spiders will consume time. I wish Scribd would add this and convert in real time to text.</description>
		<content:encoded><![CDATA[<p>Nice tool, but having to wait for the spiders will consume time. I wish Scribd would add this and convert in real time to text.</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: jon</title>
		<link>http://unclutterer.com/2008/11/08/google-can-now-ocr-all-pdfs/comment-page-1/#comment-23441</link>
		<dc:creator>jon</dc:creator>
		<pubDate>Sun, 09 Nov 2008 11:17:48 +0000</pubDate>
		<guid isPermaLink="false">http://unclutterer.com/?p=3125#comment-23441</guid>
		<description>Sammy, it does not go without saying, and I think the original article should be clearer. If you scan personal documents to store online for easy access for you, you are now making them easy access for anybody else, unless you take steps to protect them. The warning should come before the instructions and the warning should be much more explicit and bolder.</description>
		<content:encoded><![CDATA[<p>Sammy, it does not go without saying, and I think the original article should be clearer. If you scan personal documents to store online for easy access for you, you are now making them easy access for anybody else, unless you take steps to protect them. The warning should come before the instructions and the warning should be much more explicit and bolder.</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: Jack</title>
		<link>http://unclutterer.com/2008/11/08/google-can-now-ocr-all-pdfs/comment-page-1/#comment-23440</link>
		<dc:creator>Jack</dc:creator>
		<pubDate>Sun, 09 Nov 2008 07:06:48 +0000</pubDate>
		<guid isPermaLink="false">http://unclutterer.com/?p=3125#comment-23440</guid>
		<description>Luke, you might want to try Evernote. They do OCR for images, are private and can be kept entirely on your computer if you want, and have the best recognition I&#039;ve come across. It&#039;s where I save my product receipts, for example.</description>
		<content:encoded><![CDATA[<p>Luke, you might want to try Evernote. They do OCR for images, are private and can be kept entirely on your computer if you want, and have the best recognition I&#8217;ve come across. It&#8217;s where I save my product receipts, for example.</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: Fit Bottomed Girls</title>
		<link>http://unclutterer.com/2008/11/08/google-can-now-ocr-all-pdfs/comment-page-1/#comment-23429</link>
		<dc:creator>Fit Bottomed Girls</dc:creator>
		<pubDate>Sat, 08 Nov 2008 23:34:06 +0000</pubDate>
		<guid isPermaLink="false">http://unclutterer.com/?p=3125#comment-23429</guid>
		<description>I heart google. They&#039;re amazing.</description>
		<content:encoded><![CDATA[<p>I heart google. They&#8217;re amazing.</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: Zora</title>
		<link>http://unclutterer.com/2008/11/08/google-can-now-ocr-all-pdfs/comment-page-1/#comment-23427</link>
		<dc:creator>Zora</dc:creator>
		<pubDate>Sat, 08 Nov 2008 20:08:44 +0000</pubDate>
		<guid isPermaLink="false">http://unclutterer.com/?p=3125#comment-23427</guid>
		<description>I&#039;ve spent 5 years volunteering at Distributed Proofreaders, where we *correct* OCR of images in order to make free ebooks. Uncorrected OCR can be grossly inaccurate. If it&#039;s based on plain text, recent edition, high-quality scan, crisp print, it will probably turn out OK. If it&#039;s in any way unusual, oddly formatted, old, or low-quality, you&#039;ll get gibberish. In either case, OCR may give you what we at DP call &quot;scannos&quot;: &quot;be&quot; for &quot;he&quot;, &quot;arid&quot; for &quot;and&quot;, &quot;clown&quot; for &quot;down&quot;, etc. Until we have strong AI, only the alert human eye will catch the errors.

Be warned.</description>
		<content:encoded><![CDATA[<p>I&#8217;ve spent 5 years volunteering at Distributed Proofreaders, where we *correct* OCR of images in order to make free ebooks. Uncorrected OCR can be grossly inaccurate. If it&#8217;s based on plain text, recent edition, high-quality scan, crisp print, it will probably turn out OK. If it&#8217;s in any way unusual, oddly formatted, old, or low-quality, you&#8217;ll get gibberish. In either case, OCR may give you what we at DP call &#8220;scannos&#8221;: &#8220;be&#8221; for &#8220;he&#8221;, &#8220;arid&#8221; for &#8220;and&#8221;, &#8220;clown&#8221; for &#8220;down&#8221;, etc. Until we have strong AI, only the alert human eye will catch the errors.</p>
<p>Be warned.</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: Luke Gedeon</title>
		<link>http://unclutterer.com/2008/11/08/google-can-now-ocr-all-pdfs/comment-page-1/#comment-23426</link>
		<dc:creator>Luke Gedeon</dc:creator>
		<pubDate>Sat, 08 Nov 2008 19:32:02 +0000</pubDate>
		<guid isPermaLink="false">http://unclutterer.com/?p=3125#comment-23426</guid>
		<description>Geekmoose,

Does that work for jpg too? I have gigs of docs scanned to jpg. Does it really OCR or just index the text stored in PDFs? PDFs can store formatted text and images both.

All and Anyone,
Anyone know of a free option to OCR gigs of jpg&#039;s without making them public for Google to process?</description>
		<content:encoded><![CDATA[<p>Geekmoose,</p>
<p>Does that work for jpg too? I have gigs of docs scanned to jpg. Does it really OCR or just index the text stored in PDFs? PDFs can store formatted text and images both.</p>
<p>All and Anyone,<br />
Anyone know of a free option to OCR gigs of jpg&#8217;s without making them public for Google to process?</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: Tanja</title>
		<link>http://unclutterer.com/2008/11/08/google-can-now-ocr-all-pdfs/comment-page-1/#comment-23425</link>
		<dc:creator>Tanja</dc:creator>
		<pubDate>Sat, 08 Nov 2008 19:28:16 +0000</pubDate>
		<guid isPermaLink="false">http://unclutterer.com/?p=3125#comment-23425</guid>
		<description>Geekmoose: As far as I know, Mac OS X does not do OCR (I tried making it do it, but my png&#039;s converted to pdf never got OCR&#039;d). What Mac OS X does do is index pdf files that have the text embedded. For instance, if you print something from say Word or a website, and you select the &#039;save as PDF&#039; option, it will embed the actual text. 
Spotlight can read this and this makes the PDF searchable. A PDF only containing image (thus having just one layer) never gets indexed. I&#039;m a bit fuzzy on the details of how PDF&#039;s work exactly, but I think they have one or more layers, one of them always being an exact image of your scan / print, and others possibly the exact text and an index and such.

It could be my systems, though it happens on three different computers which makes me think it just works like that. 

Sorry about the long comment :)

Great tip for public info!</description>
		<content:encoded><![CDATA[<p>Geekmoose: As far as I know, Mac OS X does not do OCR (I tried making it do it, but my png&#8217;s converted to pdf never got OCR&#8217;d). What Mac OS X does do is index pdf files that have the text embedded. For instance, if you print something from say Word or a website, and you select the &#8217;save as PDF&#8217; option, it will embed the actual text.<br />
Spotlight can read this and this makes the PDF searchable. A PDF only containing image (thus having just one layer) never gets indexed. I&#8217;m a bit fuzzy on the details of how PDF&#8217;s work exactly, but I think they have one or more layers, one of them always being an exact image of your scan / print, and others possibly the exact text and an index and such.</p>
<p>It could be my systems, though it happens on three different computers which makes me think it just works like that. </p>
<p>Sorry about the long comment <img src='http://unclutterer.com/wp-includes/images/smilies/icon_smile.gif' alt=':)' class='wp-smiley' /> </p>
<p>Great tip for public info!</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: Geekmoose</title>
		<link>http://unclutterer.com/2008/11/08/google-can-now-ocr-all-pdfs/comment-page-1/#comment-23424</link>
		<dc:creator>Geekmoose</dc:creator>
		<pubDate>Sat, 08 Nov 2008 18:33:10 +0000</pubDate>
		<guid isPermaLink="false">http://unclutterer.com/?p=3125#comment-23424</guid>
		<description>Mac OS X will also OCR PDF files so you can search them using spotlight. Has done since 10.3 at least !</description>
		<content:encoded><![CDATA[<p>Mac OS X will also OCR PDF files so you can search them using spotlight. Has done since 10.3 at least !</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: sammy</title>
		<link>http://unclutterer.com/2008/11/08/google-can-now-ocr-all-pdfs/comment-page-1/#comment-23418</link>
		<dc:creator>sammy</dc:creator>
		<pubDate>Sat, 08 Nov 2008 12:30:28 +0000</pubDate>
		<guid isPermaLink="false">http://unclutterer.com/?p=3125#comment-23418</guid>
		<description>Um... I guess it goes without saying that you shouldn&#039;t do this with documents you want kept private?  For instance, I save my tax returns as PDFs.</description>
		<content:encoded><![CDATA[<p>Um&#8230; I guess it goes without saying that you shouldn&#8217;t do this with documents you want kept private?  For instance, I save my tax returns as PDFs.</p>
]]></content:encoded>
	</item>
</channel>
</rss>
