Tesseract api Tesseract api接口

转载

bugouhen 2024-05-13 19:37:26

文章标签 Tesseract api ocr tesseract api 函数声明 文章分类 架构后端开发

目前，Tesseract可以识别超过100种语言。也可以用来训练其它的语言。

源码包提供了一个OCR的引擎——libtesseract以及一个命令行程序——tesseract。Tesseract文字识别主要流程为：二值化，切分处理，识别，纠错等步骤。Tesseract引擎概括地可以分为图片布局分析，字符分割和识别两个部分。而其中的字符分割和识别是整个tesseract的设计目标。对于字符切割tesseract细致地可以分为四个部分：分析连通区域找；到块区域；找文本行和单词；得出(识别)文本。而Tesseract提供的API可以在baseapi.h文件中找到。本文将主要介绍baseapi.h文件中常用的api接口使用方法，运用各接口完成简单的识别调用。在随后的的文章中再一一关注各API的源码，了解具体的算法及实现方法。

（tesseract4.0的文档）

tesseract::TessBaseAPI，基础的接口函数，包含了初始化，简单的处理图片文字信息，版面分析的结果体等。
IMAGE，只是一个类，里边封装了相关的图片操作，包括图片的读取，图片参数信息的获取等。
其他，包括数据类型声明，相关结构体声明，跨平台处理，命令端参数提取等。

我们在实际中用到的就是前两个里边的东西。

Tesseract的大部分接口说明及其用法

1. void SetImage(const unsigned char* imagedata, int width, int height,
                int bytes_per_pixel, int bytes_per_line);
 
  Provide an image for Tesseract to recognize. Format is as
   * TesseractRect above. Copies the image buffer and converts to Pix.
   * SetImage clears all recognition results, and sets the rectangle to the
   * full image, so it may be followed immediately by a GetUTF8Text, and it
   * will automatically perform recognition.

为Tesseract 提供待识别的图片。

void SetImage(Pix* pix);
 
/**
   * Provide an image for Tesseract to recognize. As with SetImage above,
   * Tesseract takes its own copy of the image, so it need not persist until
   * after Recognize.
   * Pix vs raw, which to use?
   * Use Pix where possible. Tesseract uses Pix as its internal representation
   * and it is therefore more efficient to provide a Pix directly.
   */
  
1. void SetRectangle(int left, int top, int width, int height);
/**
   * Restrict recognition to a sub-rectangle of the image. Call after SetImage.
   * Each SetRectangle clears the recogntion results so multiple rectangles
   * can be recognized with the same image.
   */

识别限制到图像的一个子矩形区域,SetImage之后调用此函数。每一次该函数调用后将清除识别结果,以便同一张图像可以进行多矩形区域的识别。

3. void SetSourceResolution(int ppi);

设置源图像的分辨率（像素每英尺），可以计算最终的字体大小信息。SetImage之后调用此函数。

* Set the resolution of the source image in pixels per inch so font size
   * information can be calculated in results.  Call this after SetImage().
4. /**
   * In extreme cases only, usually with a subclass of Thresholder, it
   * is possible to provide a different Thresholder. The Thresholder may
   * be preloaded with an image, settings etc, or they may be set after.
   * Note that Tesseract takes ownership of the Thresholder and will
   * delete it when it it is replaced or the API is destructed.
   */
  void SetThresholder(ImageThresholder* thresholder) {
    delete thresholder_;
    thresholder_ = thresholder;
    ClearResults();
  }

/**
   * Get a copy of the internal thresholded image from Tesseract.
   * Caller takes ownership of the Pix and must pixDestroy it.
   * May be called any time after SetImage, or after TesseractRect.
   */
  Pix* GetThresholdedImage();

/**
   * Get the result of page layout analysis as a leptonica-style
   * Boxa, Pixa pair, in reading order.
   * Can be called before or after Recognize.
   */
  Boxa* GetRegions(Pixa** pixa);

以aleptonica-style Boxa, Pixa pair格式获得页面结构分析的结果，在Recognize前后均可被调用。

/**
   * Get the textlines as a leptonica-style
   * Boxa, Pixa pair, in reading order.
   * Can be called before or after Recognize.
   * If raw_image is true, then extract from the original image instead of the
   * thresholded image and pad by raw_padding pixels.
   * If blockids is not nullptr, the block-id of each line is also returned as an
   * array of one element per line. delete [] after use.
   * If paraids is not nullptr, the paragraph-id of each line within its block is
   * also returned as an array of one element per line. delete [] after use.
   */
  Boxa* GetTextlines(const bool raw_image, const int raw_padding,
                     Pixa** pixa, int** blockids, int** paraids);

以aleptonica-style Boxa, Pixa pair格式获取文本行，在Recognize前后均可被调用。如果blockids（block数目）是空的话，每行block-id返回每行一个元素的数组，使用之后被删除。

/*
     Helper method to extract from the thresholded image. (most common usage)
  */
  Boxa* GetTextlines(Pixa** pixa, int** blockids) {
    return GetTextlines(false, 0, pixa, blockids, nullptr);
  }
 
  /**
   * Get textlines and strips of image regions as a leptonica-style Boxa, Pixa
   * pair, in reading order. Enables downstream handling of non-rectangular
   * regions.
   * Can be called before or after Recognize.
   * If blockids is not nullptr, the block-id of each line is also returned as an
   * array of one element per line. delete [] after use.
   */
  Boxa* GetStrips(Pixa** pixa, int** blockids);

以aleptonica-style Boxa, Pixa pair格式获取图像区域的文本行和条形区域，方便后面非矩形区域的处理。在Recognize前后均可被调用

/**
   * Get the words as a leptonica-style
   * Boxa, Pixa pair, in reading order.
   * Can be called before or after Recognize.
   */
  Boxa* GetWords(Pixa** pixa);

以aleptonica-style Boxa, Pixa pair格式获取图像区域的文字，在Recognize前后均可被调用。

* Gets the individual connected (text) components (created
   * after pages segmentation step, but before recognition)
   * as a leptonica-style Boxa, Pixa pair, in reading order.
   * Can be called before or after Recognize.
   * Note: the caller is responsible for calling boxaDestroy()
   * on the returned Boxa array and pixaDestroy() on cc array.
   */
  Boxa* GetConnectedComponents(Pixa** cc);

在页面分析之后识别之间，以aleptonica-style Boxa, Pixa pair格式获得独立连通的文本区域，在Recognize前后均可被调用。

/**
   * Get the given level kind of components (block, textline, word etc.) as a
   * leptonica-style Boxa, Pixa pair, in reading order.
   * Can be called before or after Recognize.
   * If blockids is not nullptr, the block-id of each component is also returned
   * as an array of one element per component. delete [] after use.
   * If blockids is not nullptr, the paragraph-id of each component with its block
   * is also returned as an array of one element per component. delete [] after
   * use.
   * If raw_image is true, then portions of the original image are extracted
   * instead of the thresholded image and padded with raw_padding.
   * If text_only is true, then only text components are returned.
   */
  Boxa* GetComponentImages(const PageIteratorLevel level,
                           const bool text_only, const bool raw_image,
                           const int raw_padding,
                           Pixa** pixa, int** blockids, int** paraids);

以aleptonica-style Boxa, Pixa pair格式获得制定级别的元素（block，textline, word），在Recognize前后均可被调用。果blockids（block数目）是空的话，每行block-id返回每行一个元素的数组，使用之后被删除。如果text_only 为真，只有text可被返回。

// Helper function to get binary images with no padding (most common usage).
  Boxa* GetComponentImages(const PageIteratorLevel level,
                           const bool text_only,
                           Pixa** pixa, int** blockids) {
    return GetComponentImages(level, text_only, false, 0, pixa, blockids, nullptr);
  }

9：DumpPGM 函数声明：

void tesseract::TessBaseAPI::DumpPGM ( const char * filename ) 将内部二值图像放到PGM文件中。

10：AnalyseLayout 函数声明：

/**
   * Runs page layout analysis in the mode set by SetPageSegMode.
   * May optionally be called prior to Recognize to get access to just
   * the page layout results. Returns an iterator to the results.
   * If merge_similar_words is true, words are combined where suitable for use
   * with a line recognizer. Use if you want to use AnalyseLayout to find the
   * textlines, and then want to process textline fragments with an external
   * line recognizer.
   * Returns nullptr on error or an empty page.
   * The returned iterator must be deleted after use.
   * WARNING! This class points to data held within the TessBaseAPI class, and
   * therefore can only be used while the TessBaseAPI class still exists and
   * has not been subjected to a call of Init, SetImage, Recognize, Clear, End
   * DetectOS, or anything else that changes the internal PAGE_RES.
   */
  PageIterator* AnalyseLayout();
  PageIterator* AnalyseLayout(bool merge_similar_words);

以SetPageSegMode设定的模式进行页面结构分析,返回一个(iterator),错误返回为空。Iterator 使用后必须删除。注意：该函数指向TessBaseAPI 类内部的数据，因此必须在TessBaseAPI 存在的情况下才可被调用。不能被改变内部PAGE_RES的Init, SetImage, Recognize, Clear, End DetectOS或者其他调用。

11：Recognize 函数声明：

/**
   * Recognize the image from SetAndThresholdImage, generating Tesseract
   * internal structures. Returns 0 on success.
   * Optional. The Get*Text functions below will call Recognize if needed.
   * After Recognize, the output is kept internally until the next SetImage.
   */
  int Recognize(ETEXT_DESC* monitor);
int tesseract::TessBaseAPI::Recognize(ETEXT_DESC * monitor)

识别来自SetAndThresholdImage的图像，产生Tesseract 内部结构数据，成功返回0，如果需要，下面的Get*Tex函数会调用它。识别完成后，在SetImage之前，输出都会保持在内部。

12：RecognizeForChopTest 函数声明：

/**
   * Methods to retrieve information after SetAndThresholdImage(),
   * Recognize() or TesseractRect(). (Recognize is called implicitly if needed.)
   */
 
  /** Variant on Recognize used for testing chopper. */
  int RecognizeForChopTest(ETEXT_DESC* monitor);
int tesseract::TessBaseAPI::RecognizeForChopTest(ETEXT_DESC * monitor)

检索来自SetAndThresholdImage(), Recognize() or TesseractRect()的信息（在需要的情况下隐式调用Recognize）。对Recognize 变化一测试chopper. 13：ProcessPages 函数声明：

/**
   * Turns images into symbolic text.
   *
   * filename can point to a single image, a multi-page TIFF,
   * or a plain text list of image filenames.
   *
   * retry_config is useful for debugging. If not nullptr, you can fall
   * back to an alternate configuration if a page fails for some
   * reason.
   *
   * timeout_millisec terminates processing if any single page
   * takes too long. Set to 0 for unlimited time.
   *
   * renderer is responible for creating the output. For example,
   * use the TessTextRenderer if you want plaintext output, or
   * the TessPDFRender to produce searchable PDF.
   *
   * If tessedit_page_number is non-negative, will only process that
   * single page. Works for multi-page tiff file, or filelist.
   *
   * Returns true if successful, false on error.
   */
  bool ProcessPages(const char* filename, const char* retry_config,
                    int timeout_millisec, TessResultRenderer* renderer);
  // Does the real work of ProcessPages.
  bool ProcessPagesInternal(const char* filename, const char* retry_config,
                            int timeout_millisec, TessResultRenderer* renderer);
 
bool tesseract::TessBaseAPI::ProcessPages ( const char * filename,
const char * retry_config,
int STRING * )
timeout_millisec, text_out

识别指定文件的所有页面，文件格式为(a multi-page tiff or list of filenames, or single image), 并且根据参数（tessedit_create_boxfile, tessedit_make_boxes_from_boxes, tessedit_write_unlv, tessedit_create_hocr.）得到合适的文本。在输入文件的每一页运行ProcessPage，输入文件可以是（a multi-page tiff, single-page other file format, or a plain text list of images to read）,返回值放在text_out中。如果tessedit_page_number 非负，程序将会在其所代表那一页开始。运行错误返回false. 如果程序暂停在某一页timeout_millisec（非负）时间终止程序，或者由于某些原因一些页面处理失败，该页面将会以retry_config的配置文件重新处理。

14：ProcessPage 函数声明：

/**
   * Turn a single image into symbolic text.
   *
   * The pix is the image processed. filename and page_index are
   * metadata used by side-effect processes, such as reading a box
   * file or formatting as hOCR.
   *
   * See ProcessPages for desciptions of other parameters.
   */
  bool ProcessPage(Pix* pix, int page_index, const char* filename,
                   const char* retry_config, int timeout_millisec,
                   TessResultRenderer* renderer);
bool tesseract::TessBaseAPI::ProcessPage ( Pix *
int
pix,
page_index,
const char * filename, const char * retry_config, int STRING * )
timeout_millisec, text_out

为ProcessPages进行单页面识别。Text放到text_out中， pix是文件名，page_index是边缘处理后的元数据，比如box文件，或者hOCR格式文件。

15：GetIterator 函数声明：

ResultIterator * tesseract::TessBaseAPI::GetIterator()

为 LayoutAnalysis and/or Recognize运行结果获取读取顺序的迭代器（iterator），使用之后删除。

16：GetMutableIterator 函数声明：

/**
   * Get a reading-order iterator to the results of LayoutAnalysis and/or
   * Recognize. The returned iterator must be deleted after use.
   * WARNING! This class points to data held within the TessBaseAPI class, and
   * therefore can only be used while the TessBaseAPI class still exists and
   * has not been subjected to a call of Init, SetImage, Recognize, Clear, End
   * DetectOS, or anything else that changes the internal PAGE_RES.
   */
  ResultIterator* GetIterator();
MutableIterator * tesseract::TessBaseAPI::GetMutableIterator（）

为 LayoutAnalysis and/or Recognize运行结果获取可变的迭代器（iterator），使用之后删除。

17：GetUTF8Text 函数声明：

/**
   * The recognized text is returned as a char* which is coded
   * as UTF8 and must be freed with the delete [] operator.
   */
  char* GetUTF8Text();
char * tesseract::TessBaseAPI::GetUTF8Text()

识别的文本被返回为字符指针，以UTF8编码（must be freed with the delete [] operator）。从内部数据结构中获得文本字符串。

18：

函数声明：

/**
   * Make a HTML-formatted string with hOCR markup from the internal
   * data structures.
   * page_number is 0-based but will appear in the output as 1-based.
   * monitor can be used to
   *  cancel the recognition
   *  receive progress callbacks
   * Returned string must be freed with the delete [] operator.
   */
  char* GetHOCRText(ETEXT_DESC* monitor, int page_number);
 
char * tesseract::TessBaseAPI::GetHOCRText(int page_number)

本文章为转载内容，我们尊重原作者对文章享有的著作权。如有内容错误或侵权问题，欢迎原作者联系我们进行内容更正或删除文章。