OCR (Optical Character Recognition) – A technology to extract texts from an Image

Optical Character Recognition (OCR) is a technology by which we can identify/extract texts from an image and can convert them to a document file. We can search and copy paste the texts from a scanned document.

This is very useful tool for office purpose and for digitizes books and old documents.

Suppose you have big printed document of 20-50 pages and you want to do some editing work, but you don’t have original soft file (word doc file) to edit that. So you have to type it by yourself inside your PC and after you would be able to edit that document, it wastes lots of time in typing work that is unusable.

But using OCR technology software’s you can just scan that documents and convert texts from scan files to editable doc file. Many people still don’t know about this technology and waste their lot of time. I will tell you some of the tips using those you can easily use this technology in your documentation.

Online OCR –  

1. Google Doc – many people don’t know that using Google docs they can convert a scanned image to OCR scanned doc.

Just upload any scan documents and convert them at the time of uploading google doc format. Now you can select texts from that doc or now download again that file in .doc or .txt format. Or select the texts inside google doc online.

2. There are many other online websites but all of them have some limitations e.g. http://www.free-online-ocr.com/ , http://www.ocronline.com/ , http://www.sciweavers.org/free-online-ocr  etc.

Offline OCR –

There are many offline software’s are available but they are chargeable e.g. Adobe Acrobat professional, Phantom PDF by foxit. These software have some inbuilt font libraries and best suitable for English dictionaries and fonts and popular English fonts are compatible like Ariel, times new roman etc.

For using these software’s it is better that you have scan your documents with good resolution 300dpi or higher is better. Sometime it is better to scan an image as jpeg or tiff in place of pdf.

Some free softwares are also available to make a document OCR doc-

A.     FreeOCR

This uses the open source Tesseract OCR engine. Tesseract was originally developed by HP and is currently sponsored by Google.

B.    gImageReader

gImageReader is one of the front-ends to the free Tesseract OCR engine. You need to download and install Tesseract separately from this page. Tesseract engine uses OpenOffice dictionaries and spellcheckers that can be downloaded from here.

C.    SimpleOCR

SimpleOCR uses its own OCR engine that is capable of learning the fonts in a particular document.

Using Scanners –

In some new high end scanner this OCR function is inbuilt , during the time of scan you can select the option to scan as OCR documents. Generally in HP printer it is as output file format namely pdf/a. Select this file option and your scanned file will automatically OCR compatible.

Resources –  http://www.freewaregenius.com/2011/11/01/how-to-extract-text-from-images-a-comparison-of-free-ocr-tools/

http://en.wikipedia.org/wiki/Optical_character_recognition

 

Tags: