目前,Tesseract可以识别超过100种语言。也可以用来训练其它的语言。
源码包提供了一个OCR的引擎——libtesseract
以及一个命令行程序——tesseract
。Tesseract文字识别主要流程为:二值化,切分处理,识别,纠错等步骤。Tesseract引擎概括地可以分为图片布局分析,字符分割和识别两个部分。而其中的字符分割和识别是整个tesseract的设计目标。对于字符切割tesseract细致地可以分为四个部分:分析连通区域找;到块区域;找文本行和单词;得出(识别)文本。而Tesseract提供的API可以在baseapi.h文件中找到。本文将主要介绍baseapi.h文件中常用的api接口使用方法,运用各接口完成简单的识别调用。在随后的的文章中再一一关注各API的源码,了解具体的算法及实现方法。
(tesseract4.0的文档)
- tesseract::TessBaseAPI,基础的接口函数,包含了初始化,简单的处理图片文字信息,版面分析的结果体等。
- IMAGE,只是一个类,里边封装了相关的图片操作,包括图片的读取,图片参数信息的获取等。
- 其他,包括数据类型声明,相关结构体声明,跨平台处理,命令端参数提取等。
我们在实际中用到的就是前两个里边的东西。
Tesseract的大部分接口说明及其用法
1. void SetImage(const unsigned char* imagedata, int width, int height,
int bytes_per_pixel, int bytes_per_line);
Provide an image for Tesseract to recognize. Format is as
* TesseractRect above. Copies the image buffer and converts to Pix.
* SetImage clears all recognition results, and sets the rectangle to the
* full image, so it may be followed immediately by a GetUTF8Text, and it
* will automatically perform recognition.
为Tesseract 提供待识别的图片。
void SetImage(Pix* pix);
/**
* Provide an image for Tesseract to recognize. As with SetImage above,
* Tesseract takes its own copy of the image, so it need not persist until
* after Recognize.
* Pix vs raw, which to use?
* Use Pix where possible. Tesseract uses Pix as its internal representation
* and it is therefore more efficient to provide a Pix directly.
*/
1. void SetRectangle(int left, int top, int width, int height);
/**
* Restrict recognition to a sub-rectangle of the image. Call after SetImage.
* Each SetRectangle clears the recogntion results so multiple rectangles
* can be recognized with the same image.
*/
识别限制到图像的一个子矩形区域,SetImage之后调用此函数。每一次该函数调用后将清除识别结果,以便同一张图像可以进行多矩形区域的识别。
3. void SetSourceResolution(int ppi);
设置源图像的分辨率(像素每英尺),可以计算最终的字体大小信息。SetImage之后调用此函数。
* Set the resolution of the source image in pixels per inch so font size
* information can be calculated in results. Call this after SetImage().
4. /**
* In extreme cases only, usually with a subclass of Thresholder, it
* is possible to provide a different Thresholder. The Thresholder may
* be preloaded with an image, settings etc, or they may be set after.
* Note that Tesseract takes ownership of the Thresholder and will
* delete it when it it is replaced or the API is destructed.
*/
void SetThresholder(ImageThresholder* thresholder) {
delete thresholder_;
thresholder_ = thresholder;
ClearResults();
}
5.
/**
* Get a copy of the internal thresholded image from Tesseract.
* Caller takes ownership of the Pix and must pixDestroy it.
* May be called any time after SetImage, or after TesseractRect.
*/
Pix* GetThresholdedImage();
6.
/**
* Get the result of page layout analysis as a leptonica-style
* Boxa, Pixa pair, in reading order.
* Can be called before or after Recognize.
*/
Boxa* GetRegions(Pixa** pixa);
以aleptonica-style Boxa, Pixa pair格式获得页面结构分析的结果,在Recognize前后均可被调用。
7.
/**
* Get the textlines as a leptonica-style
* Boxa, Pixa pair, in reading order.
* Can be called before or after Recognize.
* If raw_image is true, then extract from the original image instead of the
* thresholded image and pad by raw_padding pixels.
* If blockids is not nullptr, the block-id of each line is also returned as an
* array of one element per line. delete [] after use.
* If paraids is not nullptr, the paragraph-id of each line within its block is
* also returned as an array of one element per line. delete [] after use.
*/
Boxa* GetTextlines(const bool raw_image, const int raw_padding,
Pixa** pixa, int** blockids, int** paraids);
以aleptonica-style Boxa, Pixa pair格式获取文本行,在Recognize前后均可被调用。如果blockids(block数目)是空的话,每行block-id返回每行一个元素的数组,使用之后被删除。
/*
Helper method to extract from the thresholded image. (most common usage)
*/
Boxa* GetTextlines(Pixa** pixa, int** blockids) {
return GetTextlines(false, 0, pixa, blockids, nullptr);
}
/**
* Get textlines and strips of image regions as a leptonica-style Boxa, Pixa
* pair, in reading order. Enables downstream handling of non-rectangular
* regions.
* Can be called before or after Recognize.
* If blockids is not nullptr, the block-id of each line is also returned as an
* array of one element per line. delete [] after use.
*/
Boxa* GetStrips(Pixa** pixa, int** blockids);
以aleptonica-style Boxa, Pixa pair格式获取图像区域的文本行和条形区域,方便后面非矩形区域的处理。在Recognize前后均可被调用
/**
* Get the words as a leptonica-style
* Boxa, Pixa pair, in reading order.
* Can be called before or after Recognize.
*/
Boxa* GetWords(Pixa** pixa);
以aleptonica-style Boxa, Pixa pair格式获取图像区域的文字,在Recognize前后均可被调用。
8.
* Gets the individual connected (text) components (created
* after pages segmentation step, but before recognition)
* as a leptonica-style Boxa, Pixa pair, in reading order.
* Can be called before or after Recognize.
* Note: the caller is responsible for calling boxaDestroy()
* on the returned Boxa array and pixaDestroy() on cc array.
*/
Boxa* GetConnectedComponents(Pixa** cc);
在页面分析之后识别之间,以aleptonica-style Boxa, Pixa pair格式获得独立连通的文本区域,在Recognize前后均可被调用。
/**
* Get the given level kind of components (block, textline, word etc.) as a
* leptonica-style Boxa, Pixa pair, in reading order.
* Can be called before or after Recognize.
* If blockids is not nullptr, the block-id of each component is also returned
* as an array of one element per component. delete [] after use.
* If blockids is not nullptr, the paragraph-id of each component with its block
* is also returned as an array of one element per component. delete [] after
* use.
* If raw_image is true, then portions of the original image are extracted
* instead of the thresholded image and padded with raw_padding.
* If text_only is true, then only text components are returned.
*/
Boxa* GetComponentImages(const PageIteratorLevel level,
const bool text_only, const bool raw_image,
const int raw_padding,
Pixa** pixa, int** blockids, int** paraids);
以aleptonica-style Boxa, Pixa pair格式获得制定级别的元素(block,textline, word),在Recognize前后均可被调用。果blockids(block数目)是空的话,每行block-id返回每行一个元素的数组,使用之后被删除。如果text_only 为真,只有text可被返回。
// Helper function to get binary images with no padding (most common usage).
Boxa* GetComponentImages(const PageIteratorLevel level,
const bool text_only,
Pixa** pixa, int** blockids) {
return GetComponentImages(level, text_only, false, 0, pixa, blockids, nullptr);
}
9:DumpPGM 函数声明:
void tesseract::TessBaseAPI::DumpPGM ( const char * filename ) 将内部二值图像放到PGM文件中。
10:AnalyseLayout 函数声明:
/**
* Runs page layout analysis in the mode set by SetPageSegMode.
* May optionally be called prior to Recognize to get access to just
* the page layout results. Returns an iterator to the results.
* If merge_similar_words is true, words are combined where suitable for use
* with a line recognizer. Use if you want to use AnalyseLayout to find the
* textlines, and then want to process textline fragments with an external
* line recognizer.
* Returns nullptr on error or an empty page.
* The returned iterator must be deleted after use.
* WARNING! This class points to data held within the TessBaseAPI class, and
* therefore can only be used while the TessBaseAPI class still exists and
* has not been subjected to a call of Init, SetImage, Recognize, Clear, End
* DetectOS, or anything else that changes the internal PAGE_RES.
*/
PageIterator* AnalyseLayout();
PageIterator* AnalyseLayout(bool merge_similar_words);
以SetPageSegMode设定的模式进行页面结构分析,返回一个(iterator),错误返回为空。Iterator 使用后必须删除。注意:该函数指向TessBaseAPI 类内部的数据,因此必须在TessBaseAPI 存在的情况下才可被调用。不能被改变内部PAGE_RES的Init, SetImage, Recognize, Clear, End DetectOS或者其他调用。
11:Recognize 函数声明:
/**
* Recognize the image from SetAndThresholdImage, generating Tesseract
* internal structures. Returns 0 on success.
* Optional. The Get*Text functions below will call Recognize if needed.
* After Recognize, the output is kept internally until the next SetImage.
*/
int Recognize(ETEXT_DESC* monitor);
int tesseract::TessBaseAPI::Recognize(ETEXT_DESC * monitor)
识别 来自SetAndThresholdImage的图像,产生Tesseract 内部结构数据,成功返回0,如果需要,下面的Get*Tex函数会调用它。识别完成后,在SetImage之前,输出都会保持在内部。
12:RecognizeForChopTest 函数声明:
/**
* Methods to retrieve information after SetAndThresholdImage(),
* Recognize() or TesseractRect(). (Recognize is called implicitly if needed.)
*/
/** Variant on Recognize used for testing chopper. */
int RecognizeForChopTest(ETEXT_DESC* monitor);
int tesseract::TessBaseAPI::RecognizeForChopTest(ETEXT_DESC * monitor)
检索来自SetAndThresholdImage(), Recognize() or TesseractRect()的信息(在需要的情况下隐式调用Recognize)。对Recognize 变化一测试chopper. 13:ProcessPages 函数声明:
/**
* Turns images into symbolic text.
*
* filename can point to a single image, a multi-page TIFF,
* or a plain text list of image filenames.
*
* retry_config is useful for debugging. If not nullptr, you can fall
* back to an alternate configuration if a page fails for some
* reason.
*
* timeout_millisec terminates processing if any single page
* takes too long. Set to 0 for unlimited time.
*
* renderer is responible for creating the output. For example,
* use the TessTextRenderer if you want plaintext output, or
* the TessPDFRender to produce searchable PDF.
*
* If tessedit_page_number is non-negative, will only process that
* single page. Works for multi-page tiff file, or filelist.
*
* Returns true if successful, false on error.
*/
bool ProcessPages(const char* filename, const char* retry_config,
int timeout_millisec, TessResultRenderer* renderer);
// Does the real work of ProcessPages.
bool ProcessPagesInternal(const char* filename, const char* retry_config,
int timeout_millisec, TessResultRenderer* renderer);
bool tesseract::TessBaseAPI::ProcessPages ( const char * filename,
const char * retry_config,
int STRING * )
timeout_millisec, text_out
识别指定文件的所有页面,文件格式为(a multi-page tiff or list of filenames, or single image), 并且根据参数(tessedit_create_boxfile, tessedit_make_boxes_from_boxes, tessedit_write_unlv, tessedit_create_hocr.)得到合适的文本。在输入文件的每一页运行ProcessPage,输入文件可以是(a multi-page tiff, single-page other file format, or a plain text list of images to read),返回值放在text_out中。如果tessedit_page_number 非负,程序将会在其所代表那一页开始。运行错误返回false. 如果程序暂停在某一页timeout_millisec(非负)时间终止程序,或者由于某些原因一些页面处理失败,该页面将会以retry_config的配置文件重新处理。
14:ProcessPage 函数声明:
/**
* Turn a single image into symbolic text.
*
* The pix is the image processed. filename and page_index are
* metadata used by side-effect processes, such as reading a box
* file or formatting as hOCR.
*
* See ProcessPages for desciptions of other parameters.
*/
bool ProcessPage(Pix* pix, int page_index, const char* filename,
const char* retry_config, int timeout_millisec,
TessResultRenderer* renderer);
bool tesseract::TessBaseAPI::ProcessPage ( Pix *
int
pix,
page_index,
const char * filename, const char * retry_config, int STRING * )
timeout_millisec, text_out
为ProcessPages进行单页面识别。Text放到text_out中, pix是文件名,page_index是边缘处理后的元数据,比如box文件,或者hOCR格式文件。
15:GetIterator 函数声明:
ResultIterator * tesseract::TessBaseAPI::GetIterator()
为 LayoutAnalysis and/or Recognize运行结果获取读取顺序的迭代器(iterator),使用之后删除。
16:GetMutableIterator 函数声明:
/**
* Get a reading-order iterator to the results of LayoutAnalysis and/or
* Recognize. The returned iterator must be deleted after use.
* WARNING! This class points to data held within the TessBaseAPI class, and
* therefore can only be used while the TessBaseAPI class still exists and
* has not been subjected to a call of Init, SetImage, Recognize, Clear, End
* DetectOS, or anything else that changes the internal PAGE_RES.
*/
ResultIterator* GetIterator();
MutableIterator * tesseract::TessBaseAPI::GetMutableIterator()
为 LayoutAnalysis and/or Recognize运行结果获取可变的迭代器(iterator),使用之后删除。
17:GetUTF8Text 函数声明:
/**
* The recognized text is returned as a char* which is coded
* as UTF8 and must be freed with the delete [] operator.
*/
char* GetUTF8Text();
char * tesseract::TessBaseAPI::GetUTF8Text()
识别的文本被返回为字符指针,以UTF8编码(must be freed with the delete [] operator)。从内部数据结构中获得文本字符串。
18:
函数声明:
/**
* Make a HTML-formatted string with hOCR markup from the internal
* data structures.
* page_number is 0-based but will appear in the output as 1-based.
* monitor can be used to
* cancel the recognition
* receive progress callbacks
* Returned string must be freed with the delete [] operator.
*/
char* GetHOCRText(ETEXT_DESC* monitor, int page_number);
char * tesseract::TessBaseAPI::GetHOCRText(int page_number)