unicharambigs (5) - Linux Manuals

unicharambigs: Tesseract unicharset ambiguities


unicharambigs - Tesseract unicharset ambiguities


The unicharambigs file (a component of traineddata, see combine_tessdata(1) ) is used by Tesseract to represent possible ambiguities between characters, or groups of characters.

The file contains a number of lines, laid out as follow:

[num] <TAB> [char(s)] <TAB> [num] <TAB> [char(s)] <TAB> [num]

Field one

the number of characters contained in field two

Field two

the character sequence to be replaced

Field three

the number of characters contained in field four

Field four

the character sequence used to replace field two

Field five

contains either 1 or 0. 1 denotes a mandatory replacement, 0 denotes an optional replacement.

Characters appearing in fields two and four should appear in unicharset. The numbers in fields one and three refer to the number of unichars (not bytes).


2       ' '     1       "     1
1       m       2       r n   0
3       i i i   1       m     0

In this example, all instances of the 2 character sequence '' will always be replaced by the 1 character sequence "; a 1 character sequence m may be replaced by the 2 character sequence rn, and the 3 character sequence may be replaced by the 1 character sequence m.


The unicharambigs file first appeared in Tesseract 3.00; prior to that, a similar format, called DangAmbigs (dangerous ambiguities) was used: the format was almost identical, except only mandatory replacements could be specified, and field 5 was absent.


This is a documentation "bug": it's not currently clear what should be done in the case of ligatures (such as fi) which may also appear as regular letters in the unicharset.


The Tesseract OCR engine was written by Ray Smith and his research groups at Hewlett Packard (1985-1995) and Google (2006-present).


tesseract(1), unicharset(5)