Tesseract is anoptical character recognition engine for various operating systems.[5] It isfree software, released under theApache License.[1][6][7] Originally developed byHewlett-Packard as proprietary software in the 1980s, it was released as open source in 2005 and development was sponsored byGoogle in 2006.[8]
In 2006, Tesseract was considered one of the most accurate open-source OCR engines available.[7][9]
The Tesseract engine was originally developed as proprietary software atHewlett-Packard labs inBristol, England andGreeley, Colorado,United States between 1985 and 1994, with more changes made in 1996 to port to Windows, and partial migration fromC toC++ in 1998. A majority of the code was written in C, some written in C++. Since then, all the code has been converted to C++.[1] Very little work was done in the following decade. It was then released as an open source in 2005 by Hewlett-Packard and theUniversity of Nevada, Las Vegas (UNLV). Tesseract development was sponsored byGoogle in 2006.[8]
Version 4 addsLSTM-based OCR engine and models for many additional languages and scripts, bringing the total to 116 languages.[10] Additionally, 37scripts are supported.
Since 2018, theMannheim University Library has contributed to the development of Tesseract through several projects. Most of these were funded by theGerman Research Foundation.[11][12]
Version 5 was released in 2021.[13]
Tesseract was in the top three OCR engines in terms of character accuracy in 1995.[14] It is available forLinux,Windows andMac OS X.[6][7]
Tesseract, up to and including version 2, could only accept TIFF images of simple one-column text as inputs. These early versions did not include layout analysis, and so inputting multi-columned text, images, or equations produced garbled output. Since version 3, Tesseract has supported output text formatting,hOCR[15] positional information and page-layout analysis. Support for a number of new image formats was added using theLeptonica library. Tesseract can detect whether text ismonospaced or proportionally spaced.[7]
The initial versions of Tesseract could only recognize English-language text.
Tesseract v2 added six additional Western languages (French, Italian, German, Spanish, Brazilian Portuguese, Dutch).
Version 3 extended language support significantly to include ideographic (Chinese & Japanese) and right-to-left (e.g. Arabic, Hebrew) languages, as well as many more scripts. New languages included Arabic, Bulgarian, Catalan, Chinese (Simplified and Traditional), Croatian, Czech, Danish, German (Fraktur script), Greek, Finnish, Hebrew, Hindi, Hungarian, Indonesian, Japanese, Korean, Latvian, Lithuanian, Norwegian, Polish, Portuguese, Romanian, Russian, Serbian, Slovak (standard and Fraktur script), Slovenian, Swedish, Tagalog, Tamil, Thai, Turkish, Ukrainian and Vietnamese.
V3.04, released in July 2015, added an additional 39 language/script combinations, bringing the total count of support languages to over 100. New language codes included: amh (Amharic), asm (Assamese), aze_cyrl (Azerbaijana in Cyrillic script), bod (Tibetan), bos (Bosnian), ceb (Cebuano), cym (Welsh), dzo (Dzongkha), fas (Persian), gle (Irish), guj (Gujarati), hat (Haitian and Haitian Creole), iku (Inuktitut), jav (Javanese), kat (Georgian), kat_old (Old Georgian), kaz (Kazakh), khm (Central Khmer), kir (Kyrgyz), kur (Kurdish), lao (Lao), lat (Latin), mar (Marathi), mya (Burmese), nep (Nepali), ori (Oriya), pan (Punjabi), pus (Pashto), san (Sanskrit), sin (Sinhala), srp_latn (Serbian in Latin script), syr (Syriac), tgk (Tajik), tir (Tigrinya), uig (Uyghur), urd (Urdu), uzb (Uzbek), uzb_cyrl (Uzbek in Cyrillic script), yid (Yiddish).[16] It can be trained to work in other languages.[7]
Accuracy rates for other language processing were shown in a presentation at DAS 2016, Santorini by Ray Smith.[17]
Tesseract is suitable for use as a backend and can be used for more complicated OCR tasks including layout analysis by using a frontend such asOCRopus.[18]
Tesseract's output will have very poor quality if the input images are not preprocessed to suit it: Images (especiallyscreenshots) must bescaled up such that the textx-height is at least 20 pixels,[19] any rotation or skew must be corrected or no text will be recognized, low-frequency changes in brightness must behigh-pass filtered, or Tesseract'sbinarization stage will destroy much of the page, and dark borders must be manually removed, or they will be misinterpreted as characters.[20]

Tesseract is executed from thecommand-line interface.[21] While Tesseract is not supplied with a GUI, there are many separate projects which provide a GUI for it.[22] One common example isOCRFeeder.[23] A cross-platform open-source GUI is gImageReader[1]
In a July 2007 article on Tesseract, Anthony Kay ofLinux Journal termed it "a quirky command-line tool that does an outstanding job". At that time he noted "Tesseract is a bare-bones OCR engine. The build process is a little quirky, and the engine needs some additional features (such as layout detection), but the core feature, text recognition, is drastically better than anything else I've tried from the Open Source community. It is reasonably easy to get excellent recognition rates using nothing more than a scanner and some image tools, such asThe GIMP andNetpbm."[5]
In November 2020,Brewster Kahle from theInternet Archive praised Tesseract, saying:
Tesseract has made a major step forward in the last few years. When we last evaluated the accuracy it was not as good as the proprietary OCR, but that has changed– we have done evaluations and it is just as good, and can get better for our application because of its new architecture.[24]