Jump to content
Main menu
Main menu
move to sidebar
hide
Navigation
Main page
Recent changes
Random page
Help about MediaWiki
Special pages
Niidae Wiki
Search
Search
Appearance
Create account
Log in
Personal tools
Create account
Log in
Pages for logged out editors
learn more
Contributions
Talk
Editing
Optical character recognition
(section)
Page
Discussion
English
Read
Edit
View history
Tools
Tools
move to sidebar
hide
Actions
Read
Edit
View history
General
What links here
Related changes
Page information
Appearance
move to sidebar
hide
Warning:
You are not logged in. Your IP address will be publicly visible if you make any edits. If you
log in
or
create an account
, your edits will be attributed to your username, along with other benefits.
Anti-spam check. Do
not
fill this in!
===Text recognition=== There are two basic types of core OCR algorithm, which may produce a ranked list of candidate characters.<ref>{{cite web|url=http://www.dataid.com/aboutocr.htm |title=OCR Introduction |publisher=Dataid.com |access-date=2013-06-16}}</ref> * ''Matrix matching'' involves comparing an image to a stored glyph on a pixel-by-pixel basis; it is also known as ''pattern matching'', ''[[pattern recognition]]'', or ''[[digital image correlation|image correlation]]''. This relies on the input glyph being correctly isolated from the rest of the image, and the stored glyph being in a similar font and at the same scale. This technique works best with typewritten text and does not work well when new fonts are encountered. This is the technique early physical photocell-based OCR implemented, rather directly. * ''Feature extraction'' decomposes glyphs into "features" like lines, closed loops, line direction, and line intersections. The extraction features reduces the dimensionality of the representation and makes the recognition process computationally efficient. These features are compared with an abstract vector-like representation of a character, which might reduce to one or more glyph prototypes. General techniques of [[Feature detection (computer vision)|feature detection in computer vision]] are applicable to this type of OCR, which is commonly seen in "intelligent" [[handwriting recognition]] and most modern OCR software.<ref name="ocrwizard">{{cite web|title=How OCR Software Works|url=http://ocrwizard.com/ocr-software/how-ocr-software-works.html|url-status=dead|archive-url=https://web.archive.org/web/20090816210246/http://ocrwizard.com/ocr-software/how-ocr-software-works.html|archive-date=August 16, 2009|access-date=2013-06-16|publisher=OCRWizard}}</ref> [[Nearest neighbour classifiers]] such as the [[k-nearest neighbors algorithm]] are used to compare image features with stored glyph features and choose the nearest match.<ref>{{cite web|url=http://blog.damiles.com/2008/11/14/the-basic-patter-recognition-and-classification-with-opencv.html |title=The basic pattern recognition and classification with openCV | Damiles |publisher=Blog.damiles.com |access-date=2013-06-16|date=2008-11-14 }}</ref> Software such as [[CuneiForm (software)|Cuneiform]] and [[Tesseract (software)|Tesseract]] use a two-pass approach to character recognition. The second pass is known as adaptive recognition and uses the letter shapes recognized with high confidence on the first pass to better recognize the remaining letters on the second pass. This is advantageous for unusual fonts or low-quality scans where the font is distorted (e.g. blurred or faded).<ref name="Tesseract overview">{{cite web|author=Smith, Ray |year=2007|title=An Overview of the Tesseract OCR Engine|url=http://tesseract-ocr.googlecode.com/svn/trunk/doc/tesseracticdar2007.pdf|url-status=dead|archive-url=https://web.archive.org/web/20100928052954/http://tesseract-ocr.googlecode.com/svn/trunk/doc/tesseracticdar2007.pdf|archive-date=September 28, 2010|access-date=2013-05-23}}</ref> {{As of|2016|12}}, modern OCR software includes [[Google Docs]] OCR, [[ABBYY FineReader]], and Transym.<ref>{{Cite journal|last=Assefi|first=Mehdi|date=December 2016|title=OCR as a Service: An Experimental Evaluation of Google Docs OCR, Tesseract, ABBYY FineReader, and Transym|url=https://www.researchgate.net/publication/310645810|journal=ResearchGate}}</ref>{{update inline|date=June 2023}} Others like [[OCRopus]] and Tesseract use [[Artificial neural network|neural networks]] which are trained to recognize whole lines of text instead of focusing on single characters. A technique known as iterative OCR automatically crops a document into sections based on the page layout. OCR is then performed on each section individually using variable character confidence level thresholds to maximize page-level OCR accuracy. A patent from the United States Patent Office has been issued for this method.<ref>{{Cite web|title=How the Best OCR Technology Captures 99.91% of Data|url=https://www.bisok.com/grooper-data-capture-method-features/multi-pass-ocr/|access-date=2021-05-27|website=www.bisok.com}}</ref> The OCR result can be stored in the standardized [[ALTO (XML)|ALTO]] format, a dedicated [[XML schema]] maintained by the United States [[Library of Congress]]. Other common formats include [[hOCR]] and [[Page Analysis and Ground Truth Elements|PAGE XML]]. For a list of optical character recognition software, see [[Comparison of optical character recognition software]].
Summary:
Please note that all contributions to Niidae Wiki may be edited, altered, or removed by other contributors. If you do not want your writing to be edited mercilessly, then do not submit it here.
You are also promising us that you wrote this yourself, or copied it from a public domain or similar free resource (see
Encyclopedia:Copyrights
for details).
Do not submit copyrighted work without permission!
Cancel
Editing help
(opens in new window)
Search
Search
Editing
Optical character recognition
(section)
Add topic