目前,Tesseract可以识别超过100种语言。也可以用来训练其它的语言。

源码包提供了一个OCR的引擎——libtesseract以及一个命令行程序——tesseract。Tesseract文字识别主要流程为:二值化,切分处理,识别,纠错等步骤。Tesseract引擎概括地可以分为图片布局分析,字符分割和识别两个部分。而其中的字符分割和识别是整个tesseract的设计目标。对于字符切割tesseract细致地可以分为四个部分:分析连通区域找;到块区域;找文本行和单词;得出(识别)文本。而Tesseract提供的API可以在baseapi.h文件中找到。本文将主要介绍baseapi.h文件中常用的api接口使用方法,运用各接口完成简单的识别调用。在随后的的文章中再一一关注各API的源码,了解具体的算法及实现方法。

(tesseract4.0的文档)

  1. tesseract::TessBaseAPI,基础的接口函数,包含了初始化,简单的处理图片文字信息,版面分析的结果体等。 
  2. IMAGE,只是一个类,里边封装了相关的图片操作,包括图片的读取,图片参数信息的获取等。 
  3. 其他,包括数据类型声明,相关结构体声明,跨平台处理,命令端参数提取等。  

我们在实际中用到的就是前两个里边的东西。

 

Tesseract的大部分接口说明及其用法

1. void SetImage(const unsigned char* imagedata, int width, int height,
                int bytes_per_pixel, int bytes_per_line);
 
  Provide an image for Tesseract to recognize. Format is as
   * TesseractRect above. Copies the image buffer and converts to Pix.
   * SetImage clears all recognition results, and sets the rectangle to the
   * full image, so it may be followed immediately by a GetUTF8Text, and it
   * will automatically perform recognition.

   为Tesseract 提供待识别的图片。

 

void SetImage(Pix* pix);
 
/**
   * Provide an image for Tesseract to recognize. As with SetImage above,
   * Tesseract takes its own copy of the image, so it need not persist until
   * after Recognize.
   * Pix vs raw, which to use?
   * Use Pix where possible. Tesseract uses Pix as its internal representation
   * and it is therefore more efficient to provide a Pix directly.
   */
  
1. void SetRectangle(int left, int top, int width, int height);
/**
   * Restrict recognition to a sub-rectangle of the image. Call after SetImage.
   * Each SetRectangle clears the recogntion results so multiple rectangles
   * can be recognized with the same image.
   */

识别限制到图像的一个子矩形区域,SetImage之后调用此函数。每一次该函数调用后将清除识别结果,以便同一张图像可以进行多矩形区域的识别。

 

3.  void SetSourceResolution(int ppi);

设置源图像的分辨率(像素每英尺),可以计算最终的字体大小信息。SetImage之后调用此函数。

* Set the resolution of the source image in pixels per inch so font size
   * information can be calculated in results.  Call this after SetImage().
4. /**
   * In extreme cases only, usually with a subclass of Thresholder, it
   * is possible to provide a different Thresholder. The Thresholder may
   * be preloaded with an image, settings etc, or they may be set after.
   * Note that Tesseract takes ownership of the Thresholder and will
   * delete it when it it is replaced or the API is destructed.
   */
  void SetThresholder(ImageThresholder* thresholder) {
    delete thresholder_;
    thresholder_ = thresholder;
    ClearResults();
  }

 

5.

/**
   * Get a copy of the internal thresholded image from Tesseract.
   * Caller takes ownership of the Pix and must pixDestroy it.
   * May be called any time after SetImage, or after TesseractRect.
   */
  Pix* GetThresholdedImage();

 

6.

/**
   * Get the result of page layout analysis as a leptonica-style
   * Boxa, Pixa pair, in reading order.
   * Can be called before or after Recognize.
   */
  Boxa* GetRegions(Pixa** pixa);

以aleptonica-style Boxa, Pixa pair格式获得页面结构分析的结果,在Recognize前后均可被调用。

 

7.

/**
   * Get the textlines as a leptonica-style
   * Boxa, Pixa pair, in reading order.
   * Can be called before or after Recognize.
   * If raw_image is true, then extract from the original image instead of the
   * thresholded image and pad by raw_padding pixels.
   * If blockids is not nullptr, the block-id of each line is also returned as an
   * array of one element per line. delete [] after use.
   * If paraids is not nullptr, the paragraph-id of each line within its block is
   * also returned as an array of one element per line. delete [] after use.
   */
  Boxa* GetTextlines(const bool raw_image, const int raw_padding,
                     Pixa** pixa, int** blockids, int** paraids);

 

以aleptonica-style Boxa, Pixa pair格式获取文本行,在Recognize前后均可被调用。如果blockids(block数目)是空的话,每行block-id返回每行一个元素的数组,使用之后被删除。

/*
     Helper method to extract from the thresholded image. (most common usage)
  */
  Boxa* GetTextlines(Pixa** pixa, int** blockids) {
    return GetTextlines(false, 0, pixa, blockids, nullptr);
  }
 
  /**
   * Get textlines and strips of image regions as a leptonica-style Boxa, Pixa
   * pair, in reading order. Enables downstream handling of non-rectangular
   * regions.
   * Can be called before or after Recognize.
   * If blockids is not nullptr, the block-id of each line is also returned as an
   * array of one element per line. delete [] after use.
   */
  Boxa* GetStrips(Pixa** pixa, int** blockids);

以aleptonica-style Boxa, Pixa pair格式获取图像区域的文本行和条形区域,方便后面非矩形区域的处理。在Recognize前后均可被调用

/**
   * Get the words as a leptonica-style
   * Boxa, Pixa pair, in reading order.
   * Can be called before or after Recognize.
   */
  Boxa* GetWords(Pixa** pixa);

以aleptonica-style Boxa, Pixa pair格式获取图像区域的文字,在Recognize前后均可被调用。

 

8.  

* Gets the individual connected (text) components (created
   * after pages segmentation step, but before recognition)
   * as a leptonica-style Boxa, Pixa pair, in reading order.
   * Can be called before or after Recognize.
   * Note: the caller is responsible for calling boxaDestroy()
   * on the returned Boxa array and pixaDestroy() on cc array.
   */
  Boxa* GetConnectedComponents(Pixa** cc);

 

在页面分析之后识别之间,以aleptonica-style Boxa, Pixa pair格式获得独立连通的文本区域,在Recognize前后均可被调用。

/**
   * Get the given level kind of components (block, textline, word etc.) as a
   * leptonica-style Boxa, Pixa pair, in reading order.
   * Can be called before or after Recognize.
   * If blockids is not nullptr, the block-id of each component is also returned
   * as an array of one element per component. delete [] after use.
   * If blockids is not nullptr, the paragraph-id of each component with its block
   * is also returned as an array of one element per component. delete [] after
   * use.
   * If raw_image is true, then portions of the original image are extracted
   * instead of the thresholded image and padded with raw_padding.
   * If text_only is true, then only text components are returned.
   */
  Boxa* GetComponentImages(const PageIteratorLevel level,
                           const bool text_only, const bool raw_image,
                           const int raw_padding,
                           Pixa** pixa, int** blockids, int** paraids);

以aleptonica-style Boxa, Pixa pair格式获得制定级别的元素(block,textline, word),在Recognize前后均可被调用。果blockids(block数目)是空的话,每行block-id返回每行一个元素的数组,使用之后被删除。如果text_only 为真,只有text可被返回。

// Helper function to get binary images with no padding (most common usage).
  Boxa* GetComponentImages(const PageIteratorLevel level,
                           const bool text_only,
                           Pixa** pixa, int** blockids) {
    return GetComponentImages(level, text_only, false, 0, pixa, blockids, nullptr);
  }

 

9:DumpPGM 函数声明:

void tesseract::TessBaseAPI::DumpPGM ( const char * filename ) 将内部二值图像放到PGM文件中。

10:AnalyseLayout 函数声明:

/**
   * Runs page layout analysis in the mode set by SetPageSegMode.
   * May optionally be called prior to Recognize to get access to just
   * the page layout results. Returns an iterator to the results.
   * If merge_similar_words is true, words are combined where suitable for use
   * with a line recognizer. Use if you want to use AnalyseLayout to find the
   * textlines, and then want to process textline fragments with an external
   * line recognizer.
   * Returns nullptr on error or an empty page.
   * The returned iterator must be deleted after use.
   * WARNING! This class points to data held within the TessBaseAPI class, and
   * therefore can only be used while the TessBaseAPI class still exists and
   * has not been subjected to a call of Init, SetImage, Recognize, Clear, End
   * DetectOS, or anything else that changes the internal PAGE_RES.
   */
  PageIterator* AnalyseLayout();
  PageIterator* AnalyseLayout(bool merge_similar_words);

以SetPageSegMode设定的模式进行页面结构分析,返回一个(iterator),错误返回为空。Iterator 使用后必须删除。注意:该函数指向TessBaseAPI 类内部的数据,因此必须在TessBaseAPI 存在的情况下才可被调用。不能被改变内部PAGE_RES的Init, SetImage, Recognize, Clear, End DetectOS或者其他调用。

11:Recognize 函数声明:

/**
   * Recognize the image from SetAndThresholdImage, generating Tesseract
   * internal structures. Returns 0 on success.
   * Optional. The Get*Text functions below will call Recognize if needed.
   * After Recognize, the output is kept internally until the next SetImage.
   */
  int Recognize(ETEXT_DESC* monitor);
int tesseract::TessBaseAPI::Recognize(ETEXT_DESC * monitor)

识别 来自SetAndThresholdImage的图像,产生Tesseract 内部结构数据,成功返回0,如果需要,下面的Get*Tex函数会调用它。识别完成后,在SetImage之前,输出都会保持在内部。

12:RecognizeForChopTest 函数声明:

/**
   * Methods to retrieve information after SetAndThresholdImage(),
   * Recognize() or TesseractRect(). (Recognize is called implicitly if needed.)
   */
 
  /** Variant on Recognize used for testing chopper. */
  int RecognizeForChopTest(ETEXT_DESC* monitor);
int tesseract::TessBaseAPI::RecognizeForChopTest(ETEXT_DESC * monitor)

检索来自SetAndThresholdImage(), Recognize() or TesseractRect()的信息(在需要的情况下隐式调用Recognize)。对Recognize 变化一测试chopper. 13:ProcessPages 函数声明:

/**
   * Turns images into symbolic text.
   *
   * filename can point to a single image, a multi-page TIFF,
   * or a plain text list of image filenames.
   *
   * retry_config is useful for debugging. If not nullptr, you can fall
   * back to an alternate configuration if a page fails for some
   * reason.
   *
   * timeout_millisec terminates processing if any single page
   * takes too long. Set to 0 for unlimited time.
   *
   * renderer is responible for creating the output. For example,
   * use the TessTextRenderer if you want plaintext output, or
   * the TessPDFRender to produce searchable PDF.
   *
   * If tessedit_page_number is non-negative, will only process that
   * single page. Works for multi-page tiff file, or filelist.
   *
   * Returns true if successful, false on error.
   */
  bool ProcessPages(const char* filename, const char* retry_config,
                    int timeout_millisec, TessResultRenderer* renderer);
  // Does the real work of ProcessPages.
  bool ProcessPagesInternal(const char* filename, const char* retry_config,
                            int timeout_millisec, TessResultRenderer* renderer);
 
bool tesseract::TessBaseAPI::ProcessPages ( const char * filename,
const char * retry_config,
int STRING * )
timeout_millisec, text_out

识别指定文件的所有页面,文件格式为(a multi-page tiff or list of filenames, or single image), 并且根据参数(tessedit_create_boxfile, tessedit_make_boxes_from_boxes, tessedit_write_unlv, tessedit_create_hocr.)得到合适的文本。在输入文件的每一页运行ProcessPage,输入文件可以是(a multi-page tiff, single-page other file format, or a plain text list of images to read),返回值放在text_out中。如果tessedit_page_number 非负,程序将会在其所代表那一页开始。运行错误返回false. 如果程序暂停在某一页timeout_millisec(非负)时间终止程序,或者由于某些原因一些页面处理失败,该页面将会以retry_config的配置文件重新处理。

14:ProcessPage 函数声明:

/**
   * Turn a single image into symbolic text.
   *
   * The pix is the image processed. filename and page_index are
   * metadata used by side-effect processes, such as reading a box
   * file or formatting as hOCR.
   *
   * See ProcessPages for desciptions of other parameters.
   */
  bool ProcessPage(Pix* pix, int page_index, const char* filename,
                   const char* retry_config, int timeout_millisec,
                   TessResultRenderer* renderer);
bool tesseract::TessBaseAPI::ProcessPage ( Pix *
int
pix,
page_index,
const char * filename, const char * retry_config, int STRING * )
timeout_millisec, text_out

为ProcessPages进行单页面识别。Text放到text_out中, pix是文件名,page_index是边缘处理后的元数据,比如box文件,或者hOCR格式文件。

15:GetIterator 函数声明:

ResultIterator * tesseract::TessBaseAPI::GetIterator()

为 LayoutAnalysis and/or Recognize运行结果获取读取顺序的迭代器(iterator),使用之后删除。

16:GetMutableIterator 函数声明:

 

/**
   * Get a reading-order iterator to the results of LayoutAnalysis and/or
   * Recognize. The returned iterator must be deleted after use.
   * WARNING! This class points to data held within the TessBaseAPI class, and
   * therefore can only be used while the TessBaseAPI class still exists and
   * has not been subjected to a call of Init, SetImage, Recognize, Clear, End
   * DetectOS, or anything else that changes the internal PAGE_RES.
   */
  ResultIterator* GetIterator();
MutableIterator * tesseract::TessBaseAPI::GetMutableIterator()

为 LayoutAnalysis and/or Recognize运行结果获取可变的迭代器(iterator),使用之后删除。

17:GetUTF8Text 函数声明:            

/**
   * The recognized text is returned as a char* which is coded
   * as UTF8 and must be freed with the delete [] operator.
   */
  char* GetUTF8Text();
char * tesseract::TessBaseAPI::GetUTF8Text()

识别的文本被返回为字符指针,以UTF8编码(must be freed with the delete [] operator)。从内部数据结构中获得文本字符串。

 18:

函数声明:

/**
   * Make a HTML-formatted string with hOCR markup from the internal
   * data structures.
   * page_number is 0-based but will appear in the output as 1-based.
   * monitor can be used to
   *  cancel the recognition
   *  receive progress callbacks
   * Returned string must be freed with the delete [] operator.
   */
  char* GetHOCRText(ETEXT_DESC* monitor, int page_number);
 
char * tesseract::TessBaseAPI::GetHOCRText(int page_number)