OCR


What is OCR?

A method of mechanical or digital conversion of handwritten, typewritten or printed text into machine-encoded text is called optical character recognition or OCR. OCR technology allows converting various document types in editable form with search option. Modern OCR technologies achieve precision of scanned typewritten document recognition of up to 99%.

OCR systems

Nowadays, OCR systems are widely used. Possible variants of OCR application are recognition of documents, identification numbers of cars and wagons in automatic control and monitoring systems, recognition of web-site graphic content, etc. There are both commercial OCR and open source solutions. OpenOCR is the first Russian open source OCR system. This system is an extension of Cognitive Cuneiform recognition core from Cognitive Technologies based on scientific research by Institute for Systems Analysis RAS. Self-learning algorithms (called adaptive recognition) designed with Cognitive Technologies in CuneiForm OCR system is a method based on a combination of two types of typewritten character recognition algorithms: multifont and omnifont. OCR system can learn itself and recognize poorly printed symbols with help of this font but with well printed symbols. Thus, if, when recognizing low-quality documents with traditional methods, OCR-system makes many mistakes compared with the threshold level, final recognition is done with help of adaptive algorithms. OpenOCR is positioned as an OCR system, which doesn’t change appearance of the recognized document. Such systems not only recognize documents, but also define various typographic parameters (font size and type, typeface), keep the same position of text, graphics, page layout, indent, etc. Typical examples of such systems are OCR systems for home application.

Any OCR system is based on three fundamental principles: sustainability, purposefulness and adaptability. Sustainability principle is a principle, when the recognition object is perceived with OCR system as a set of significant elements related with each other. The principle of purposefulness allows the OCR system suggest a hypothesis on value of this or that character in the set of elements and prove correctness or falsity of the suggested hypothesis basing on initial data. Finally, adaptability is the principle of OCR systems, where every successful recognition is entered into the database and used in the further work. That is adaptability allows OCR systems learn themselves.