Glossary
Our glossary explains some of the most important technical terms from the areas of Character Recognition, Forms Processing and Data Capture. This collection does not claim to be complete or scientifically correct. Its purpose is rather to give beginners an idea of what is meant by these terms.
Barcode
A barcode is a machine-readable sequence of vertical bars with different distances and widths encoding a number or an alphanumeric piece of text. Barcodes are imprinted on packages or documents for automatic identification and can be read with special barcode or document scanners.
Character
A character is a standardized code for storing and displaying a letter, a digit or a special symbol on a computer. The set of all allowed characters is also called a character set.
Character Recognition
Character recognition is a synonym for the automatic conversion of printed characters into editable text files. See also OCR.
Character Set
A character set is the amount of all characters displayable on a computer system. Depending on the application there are 8 bit characters sets with 256 characters (e.g. ANSI or ASCII) and 16 bit character sets with many thousand characters (Unicode). For 8 bit character sets there are many national variations, so called code pages for displaying language-specific characters.
Color Depth
The color depth of a raster image determines the maximum number of different values for each image point. Usually the color depth is 1 Bit (back and white), 8 Bits (256 shades of gray) or 24 Bit (color). For OCR purposes black-and-white images are fully sufficient as OCR engines cannot use grayscale or color information. For converting documents with embedded images most modern programs also support scanning in color mode.
Data Capture
Data capture is the collective term for entering data into a computer. One can distinguish between manual and automatic data capture where automatic data capture is aiming at replacing manual data entry completely or at least partially. Systems for automatic data capture are for example scanners and OCR software.
Document Recognition
Document recognition is the process of converting a document pictured by a raster image into an editable form. For document recognition printed documents will usually be scanned and converted into a text file (e.g. DOCX, HTML, PDF, TXT) using an OCR software for further editing in a word processor.
Drop-out Color
A drop-out color is one of the primary colors (red, green or blue) that is willingly supressed during scanning. By using a drop-out color pre-printed lines or frames on forms or questionnaires can be filtered out. This enables a better compression of the images and also ensures an optimal recognition using OCR or ICR.
Document Scanner
Document scanners are peripheral devices for PCs for digitizing documents quickly. In comparison to flatbed scanners document scanners are always equipped with an automatic document feeder (ADF) accepting batch sizes between 50 and 1000 pages. Features optionally provided by document scanners include drop-out color, duplex capability and endorsers.
The most popular manufactures of document scanners are Bell+Howell, Canon, Fujitsu, Kodak and Xerox.
Flatbed Scanner
Flatbed scanners are peripheral devices for PCs for digitizing documents. In comparison to document scanners flatbed scanners are usually not equipped with an automatic document feeder (ADF) and accept only one page at a time. Flatbed scanners are therefore not suited well for processing large amounts of pages can, however, also be used for scanning books.
The most popular manufactures of flatbed scanners are Epson and Hewlett-Packard.
Form Capture, Form Processing, Form Recognition
Form capture, recognition and processing are synonymous for the automatic capture of the contents of filled-out forms on the precondition that all pages have exactly the same structure and the items to be read are always placed at the same positions. According to the structure of the forms and how they are filled out the data items will be read using OCR, ICR or OMR.
Handwriting recognition (ICR) is a software technology for the automatic recognition of handwritten characters. Depending on the application one can distinguish between vector-based handwriting recognition as it is used with tablet PCs and PDAs, and raster-based handwriting recognition which is required for recognizing scanned documents. Read more…
ICR (Intelligent Character Recognition) is the technical term for the automatic recognition of handwritten characters. See also Handwriting Recognition. Read more…
Image File
An image file is a coded form of an digitized image. Image files can either be raster images (bitmaps) or vector images. For OCR purposes, only raster images are meaningful. A raster image is a rectangular image with a certain resolution and color depth which can be uncompressed or compressed, according to the image file format. When working with documents image files are most often be stored in the TIFF or PDF format, for storing photographs the JPEG format is most commonly used.
ISIS
ISIS is a driver standard developed by the company Captiva for controlling document scanners under Microsoft Windows. In comparison to TWAIN the ISIS standard provides complete control over all features of a compatbile scanner and drives it at its maximum speed.
Multiple-choice Question
Multiple choice questions are often used on questionnaires and exams to offer a certain number of mutually exclusive answers for a question. Instead to answer the question with their own words the person filling out the questionnaire can check one or more of the pre-defined answers thus making the automatic evaluation much easier.
OCR (Optical Character Recognition) is the technical term for the automatic recognition of printed characters using optical rasterization (e.g. by scanners or digital cameras). Simply speaking OCR is trying to have printed text transcripted by a computer. Read more…
OCR software is a class of computer programs applying OCR. Typical examples of OCR software likeĀ OmniPage or OmniPage Server offer features for scanning or importing image files, conducting an automatic layout analysis and recognizing the contents of documents by using high-performance OCR algorithms. The so converted documents can be saved in many different formats like DOC, HTML, PDF or TXT for further processing. Read more…
Omnifont
Omnifont means font independent. In conjunction with OCR this means that not only certain pre-defined fonts will be recognized but also unkown fonts using flexible recognition algorithms that recognise chracters by more general features.
OMR (Optical Mark Recognition) is the technical term for the automatic recognition of check marks using optical rasterization (e.g. by scanners or digital cameras). Read more…
Questionnaire Processing
Questionnaire processing comprises the automatic capture of the contents of filled-out questionnaires. In comparison with forms processing questionnaires are most often designed to be filled out completely or to a large extent using check marks (see Multiple Choice Questions). This enables recognition at high speeds and a low error rate.
Resolution
The resolution of a scanner is a measure of how dense the image will be sampled. It is measured in dpi (dots per inch). For OCR purposes a resolution between 300 and 400 dpi will be sufficient, for scanning images for reproductional purposes resolutions of 1200 dpi and higher would be used.
Scanning
Scanning in conjunction with OCR is the process of optical rasterization of a printed document by using a scanner. The document will be divided into image points and each point will be assigned a value describing black and white, grayscale or color information.
Scanner
A scanner is a PC peripheral for digizing images or documents. Scanners can be categorized into barcode scanners, flatbed scanners and document scanners.
Text Capture, Text Recognition
Text capture and text recognition are synonymous for the automatic conversion of printed characters into editable text files. See also OCR.
TWAIN
TWAIN is a driver standard developed by the companies Aldus, Eastman-Kodak, Hewlett-Packard and Logitech for controlling scanners and digital cameras under Microsoft Windows. Because of its wide spread it is supported by practically all scanners and image processing programs. See also ISIS.