OCR (optical character recognition), in simple words, is a process when text within a digital image acquired in the process of scanning or other ways of digital image capture, is converted to a machine-coded text. Basically, OCR makes possible further use of information which was once printed or placed on an object in a form of text symbols. In our digital era of today and the rise of electronic documents exchange, paper documents play a major role in document exchange between organizations or in many countries required to be stored for government institutions’ checks and audits. Of course, it is great to have a paper copy of a document, but fast exchange of information and access to it in the information systems of an organization happens mostly in digital or machine form. Therefore OCR plays a vital role in enabling organizations’ potential for intensive growth and effective operation. Software systems therefore have been progressing rapidly and development of form OCR systems was carried out by software developers around the world. From simple fonts and symbols recognition OCR engines progressed and offered recognition of typewritten symbols in major World languages which brought the development and the level of technology of OCR systems to an absolutely different level.
It may look easy for us when we look at the printed document and understand what is written there. It was much harder for a machine to reach such level of “understanding” at first, when OCR systems were making first steps in development.
However, now, advanced technologies are combined together to secure the highest OCR results and maximum recognition quality and speed. Adaptive binarization algorithms are used to prepare the text fields for recognition and remove stains, background, graphic noise and other objects which may affect recognition quality; connected vocabularies and internal databases of the information systems of a companies are used to check the recognized content and make sure the symbols, their combination, phrases or sentences actually are correct or make sense – OCR now allows us to restore the original text if parts of it are covered with overlapping elements which do not allow us to clearly see what is written; multi-core architecture for the use of OCR engine capabilities provides great advantage in speed.
So, after all these qualities make a picture, how much do you think OCR systems or engines could cost? Well, the answer is: OCR systems may be freeware or may cost a fortune to implement for a large enterprise. The thing is that the OCR engine has become a core element, a foundation on which a larger and bigger sphere of business was based which is providing solutions based on OCR technology. In this case the tasks for OCR can be only limited by your fantasy and, of course, by the financial rationality of developing such custom solution or system for an organization. And the area of use of the OCR systems broadens with every massive capture project so does the technological perfection of the OCR engines. Anyways, the market offers various software systems based on OCR technology which are tailored for specific operations within the organization, such as invoice capture.
What we could recommend to a simple user of a scanner at home is a software application called Compressimo, which offers basic functions of scanning, image processing, OCR and saving the result as a PDF file. A very good and easy to use example of how great technologies can be shaped to become an application for everyone to use in three simple steps.
There are also other software products such Cognitive Forms, which will allow not only invoice capture, but practically open a new world of opportunities for any organization dealing with big paper document flow on how to optimize its performance and increase quality of work and productivity.
And we will wait for the market to show us more interesting projects and features OCR technology has in store now or will have in the future.
What is OCR?
A method of mechanical or digital conversion of handwritten, typewritten or printed text into machine-encoded text is called optical character recognition or OCR. OCR technology allows converting various document types in editable form with search option. Modern OCR technologies achieve precision of scanned typewritten document recognition of up to 99%.
Nowadays, OCR systems are widely used. Possible variants of OCR application are recognition of documents, identification numbers of cars and wagons in automatic control and monitoring systems, recognition of web-site graphic content, etc. There are both commercial OCR and open source solutions. OpenOCR is the first Russian open source OCR system. This system is an extension of Cognitive Cuneiform recognition core from Cognitive Technologies based on scientific research by Institute for Systems Analysis RAS. Self-learning algorithms (called adaptive recognition) designed with Cognitive Technologies in CuneiForm OCR system is a method based on a combination of two types of typewritten character recognition algorithms: multifont and omnifont. OCR system can learn itself and recognize poorly printed symbols with help of this font but with well printed symbols. Thus, if, when recognizing low-quality documents with traditional methods, OCR-system makes many mistakes compared with the threshold level, final recognition is done with help of adaptive algorithms. OpenOCR is positioned as an OCR system, which doesn’t change appearance of the recognized document. Such systems not only recognize documents, but also define various typographic parameters (font size and type, typeface), keep the same position of text, graphics, page layout, indent, etc. Typical examples of such systems are OCR systems for home application.
Any OCR system is based on three fundamental principles: sustainability, purposefulness and adaptability. Sustainability principle is a principle, when the recognition object is perceived with OCR system as a set of significant elements related with each other. The principle of purposefulness allows the OCR system suggest a hypothesis on value of this or that character in the set of elements and prove correctness or falsity of the suggested hypothesis basing on initial data. Finally, adaptability is the principle of OCR systems, where every successful recognition is entered into the database and used in the further work. That is adaptability allows OCR systems learn themselves.