Our glossary contains explanations of the most important technical terms from the areas of Character Recognition, Forms Processing and Data Capture. This collection does not claim to be complete or absolutely correct. Its purpose is rather to give beginners an idea of what is meant by these terms.
An archiving system is a computer program for controlled filing of digital documents. Archiving systems provide means for a fast search of the archived documents for keywords entered when the documents were filed. Archiving systems with integrated OCR are even able to automatically index scanned documents or image files and provide a full text search.
A Barcode is a machine-readable sequence of vertical bars with different distances and widths encoding a number or an alphanumeric piece of text. Barcodes are imprinted on packages or documents for automatic identification and can be read with special barcode or document scanners.
A character is a standardized code for storing and displaying a letter, a digit or a special symbol on a computer. The set of all allowed characters is also called a character set.
A character set is the amount of all characters displayable on a computer system. Depending on the application there are 8 bit characters sets with 256 characters (e.g. ANSI or ASCII) and 16 bit character sets with many thousand characters (Unicode). For 8 bit character sets there are many national variations, so called code pages for displaying language-specific characters.
The color depth of a raster image determines the maximum number of different values for each image point. Usually the color depth is 1 Bit (back and white), 8 Bits (256 shades of gray) or 24 Bit (color). For OCR purposes black-and-white images are fully sufficient as OCR engines cannot use grayscale or color information. For converting documents with embedded images most modern programs also support scanning in color mode.
Comb fields are input fields on manually filled out forms divided by vertical bars to form boxes for the individual characters. By using comb fields, especially in combination with drop-out colors a significantly higher accuracy rate can be achieved as the separation of the characters will be much easier.
Data capture is the collective term for entering data into a computer. One can distinguish between manual and automatic data capture where automatic data capture is aiming at replacing manual data entry completely or at least partially. Systems for automatic data capture are for example scanners and OCR software.
Dictionary lookup is a processing step of OCR for automatic verification or correction of text analysis results. The first OCR programs with dictionary lookup only used it for verification whereas modern OCR programs involve their dictionaries more and more into the recognition process and use them as a decision guide to resolve ambiguities.
Document recognition is the process of converting a document pictured by a raster image into an editable form. For document recognition printed documents will usually be scanned and converted into a text file (e.g. DOC, HTML, PDF, TXT) using an OCR software for further editing in a word processor.
Document scanners are peripheral devices for PCs for digitizing documents quickly. In comparison to
flatbed scanners document scanners are always equipped with an automatic document feeder
(ADF) accepting batch sizes between 50 and 1000 pages. Features optionally provided by document scanners include
drop-out color, duplex capability and endorsers.
The most popular manufactures of document scanners are Bell+Howell, Canon, Fujitsu, Kodak and Xerox. Links to all scanner manufacturers can be found under Scanner Links.
A drop-out color is one of the primary colors (red, green or blue) that is willingly supressed during
scanning. By using a drop-out color pre-printed lines or frames on forms or questionnaires can be
filtered out. This enables a better compression of the images and also ensures an optimal recognition using
OCR or ICR.
To apply a drop-out color a grayscale scanner can be equipped with a special lamp or a color scanner can be used. Modern document scanners are all able to activate a drop-out color.
Feature extraction is an OCR method for classifying characters which nowadays is being used by most OCR programs. With feature extraction all characters will be divided into geometric elements like lines, arcs and circles and the combination of these elements will be compared with stored combinations of known characters. This method provides much more flexibility than the formerly used pattern recognition and also copes well with variations in font style and size.
Flatbed scanners are peripheral devices for PCs for digitizing documents. In comparison to
document scanners flatbed scanners are usually not equipped with an automatic document
feeder (ADF) and accept only one page at a time. Flatbed scanners are therefore not suited well for processing large
amounts of pages can, however, also be used for scanning books.
The most popular manufactures of flatbed scanners are Epson and Hewlett-Packard. Links to all scanner manufacturers can be found under Scanner Links.
Forms capture, recognition and processing are sind synonymous for the automatic capture of the contents of filled-out forms on the precondition that all pages have exactly the same structure and the items to be read are always placed at the same positions. According to the structure of the forms and how they are filled out the data items will be read using OCR, ICR or OMR.
Handwriting recognition (ICR) is a software technology for the automatic recognition of handwritten characters. Depending on the application one can distinguish between vector-based handwriting recognition as it is used with tablet PCs and PDAs, and raster-based handwriting recognition which is required for recognizing scanned documents. Naturally, the rate of accuracy achieved with handwriting recognition will be much lower (around 95%) as the recognition of printed text (>99%). Therefore it is especially important to design forms using comb fields and drop-out colors enabling an optimal processing.
An image file is a coded form of an digitized image. Image files can either be raster images (bitmaps) or vector images. For OCR purposes, only raster images are meaningful. A raster image is a rectangular image with a certain resolution and color depth which can be uncompressed or compressed, according to the image file format. When working with documents image files are most often be stored in the TIFF or PDF format, for storing photographs the JPEG format is most commonly used.
ISIS is a driver standard developed by the company Captiva for controlling document scanners under Microsoft Windows. In comparison to TWAIN the ISIS standard provides complete control over all features of a compatbile scanner and drives it at its maximum speed.
Layout analysis is a processing step of OCR which is important when recognizing complex documents with multiple columns, tables or embedded images. During layout analysis the OCR software examins the structure of the document, distinguishes between images and text and tries to recognize the text flow of the document. Modern OCR software with good layout analysis can replicate the document structure almost identically with the original and save it in a text file (e.g. DOC, HTML or PDF).
In typography, a ligature occurs where two or more letters are joined as a single symbol, e.g. fi. For OCR software the difficulty with ligatures is to recognise them as two chracters during Segmentation.
Multiple choice questions are often used on questionnaires and exams to offer a certain number of answers for a question. Instead to answer the question with their own words the person filling out the questionnaire can check one or more of the pre-defined answers thus making the automatic evaluation much easier.
OCR (Optical Character Recognition is the technical term for the automatic
recognition of printed characters using optical rasterization (e.g. by scanners
or digital cameras). Simply speaking OCR is trying to have printed text transcripted by a computer.
The father of OCR is said to be Lawrence Roberts who conducted first experiments on automatic text recognition on the MIT in 1960. First practical appliances of OCR as hardware solutions appeared in 1965. Back then the recognition was limited to specially designed fonts like OCR-A and OCR-B. In 1976 Ray Kurzweil developed the first omnifont, i.e. font independent OCR system. With increasing computer performance software-based OCR solutions have gained more and more importance since the mid-eighties.
OCR can be divided into the processing steps scanning, layout analysis, segmentation, character recognition and dictionary lookup with more and more vanishing boundaries between these steps in modern systems. Typical applications of OCR are document recognition, archiving systems and forms processing.
OCR software is a class of computer programs applying
OCR. Typical examples of OCR software like OmniPage offer features for
scanning or importing image files, conducting an automatic
layout analysis and recognizing the contents of documents by using high-performance
OCR algorithms. The so converted documents can be saved in many different formats like DOC, HTML, PDF or TXT for further
For reading forms or questionnaires with a fixed structure there is specialized OCR software like FormPro. A special form of OCR software are developer tools like the OmniPage Capture SDK enabling software developers to simply integrate OCR into their own applications.
Omnifont means font independent. In conjunction with OCR this means that not only certain pre-defined fonts will be recognized but also unkown fonts using flexible recognition algorithms that recognise chracters by more general features.
OMR (Optical Mark Recognition) is the technical term for the automatic recognition of check marks using optical rasterization (e.g. by scanners or digital cameras). OMR is most often used on forms, questionnaires or exam sheets to evaluate check box fields or multiple choice questions. As OMR must only detect if a certain rectangular area contains a mark or not OMR is much more accurate than OCR or ICR.
Pattern recognition is an OCR method for classifying characters which because of its lack of flexibility is rarely used anymore. With pattern recognition the scanned characters will be compared with stored patterns after segmentation and assigned to the pattern they match most closely. This method from the early days of OCR is suited only for the recognition of fixed fonts and is used nowadays only for a few special applications. See also Feature Extraction.
Questionnaire Processing comprises the automatic capture of the contents of filled-out questionnaires. In comparison with forms processing questionnaires are most often designed to be filled out completely or to a large extent using check marks (see Multiple Choice Questions). This enables recognition at high speeds and a low error rate.
The resolution of a scanner is a measure of how dense the image will be sampled. It is measured in dpi (dots per inch). For OCR purposes a resolution between 300 and 400 dpi will be sufficient, for scanning images for reproductional purposes resolutions of 1200 dpi and higher would be used.
Scanning in conjunction with OCR is the process of optical raserization of a printed document by using
a scanner. The document will be divided into image points and each point will be assigned a value
describing black and white, grayscale or color information.
The resolution of a scanner is measured in dpi (dots per inch) and describes how densely the document will be sampled. For OCR purposes resolutions between 300 and 400 dpi are sufficient, for scanning images for reproductional purposes resolutions of 1200 dpi and higher would be used.
The time needed for scanning and the amount of memory require depend not only on the resolution but also on the color depth. The color depth determines the maximum number of different values for each image point. Usually the color depth is 1 Bit (back and white), 8 Bits (256 shades of gray) or 24 Bit (color).
The result of the scanning process is a raster image that can be opened with an image processing software. If the text contained in this image is to be edited with a word processor an OCR program must search the image for characters and interpret them.
A scanner is a PC peripheral for digizing images or documents. Scanners can be categorized into barcode scanners, flatbed scanners and document scanners. Links to all important scanner manufacturers can be found under Scanner Links.
Segmentation is a processing step of OCR. During segmentation the text image will be divided into lines and the lines will be divided into characters or ligatures. After that the characters can be recognized with a suitable OCR methode like pattern recognition or feature extraction.
TWAIN is a driver standard developed by the companies Aldus, Eastman-Kodak, Hewlett-Packard and Logitech for controlling scanners and digital cameras under Microsoft Windows. Because of its wide spread it is supported by practically all scanners and image processing programs. See also ISIS.