Google now using OCR to view scanned documents
Posted by Melissa Fillau on 14 Nov 2008 | Tagged as: Search Engine News
Thanks to the latest technological advances, gone are the days where you could post scanned documents on the web and not have them come up in the search results.
Google is now using Optical Character Recognition to view scanned documents. As of the end of October, any scanned document saved in Adobe’s PDF format can be indexed by Google.
Optical character recognition is able to convert pictures into words. The only issue being the ability to distinguish letters from numbers and coffee cup stains.
Where this is quite a simple task for the human eye, it is not that easy for a computer.
Despite the possible inaccuracies, which I’m confident will be few and far between; this will have an enormous impact on the academic world. No more unnecessary trips to the university library when you could be sitting comfortably in front of your computer at home.
Google included a few search results ([repairing aluminum wiring],[spin lock performance],[Mumps and Severe Neutropenia],[Steady success in a volatile world]), which highlight just how useful this new technology actually is.
This is a huge step forward for Google, as all image-based PDF files previously uploaded to the web can now be indexed. As Google so aptly put it; someone, somewhere thought these documents were valuable enough to share with the world. Now they will be!





November 21st, 2008 at 8:54 am
It is fantastic to see that Google is finally giving the contents of PDF docs the time of day, as usually they share a lot of information, whether it be laws, university course requirements, entry forms etc. Lately, I have been finding that through many of my searches I have found my answer in PDF format, as they are usually more factual and informative than someone’s opinion posted on a website or blog.