获得对 python-tesseract 的信心
Access confidence in python-tesseract
我正在尝试为 python-tesseract 构建一个 OCR 扩展,它专门处理具有内部结构的数据表(例如,包含行和列的小计和总计,允许用户提高准确性强制结构)。
我正在尝试访问 tesseract 分配给多个结果的置信度(例如,所有结果都来自不受约束的 运行,所有结果都来自字符数限制为 [0-9\.]
的 运行)。
我看过一些关于访问 GetHOCRText
api 方法的 x_wconf
属性的信息,但一直无法弄清楚如何从 python api。你如何call/access这个值?谢谢!
我在 OSX 10.10.3 和 Python 2.7.
上使用 python-tesseract 0.9.1
编辑
其实我都错了,我在想 pytesseract,而不是 python-tesseract。
如果您查看 API 源代码 (baseapi_mini.h),您会发现有些函数听起来非常适合您尝试执行的操作。您感兴趣的部分从第 500 行左右开始。
char* GetUTF8Text();
/**
* Make a HTML-formatted string with hOCR markup from the internal
* data structures.
* page_number is 0-based but will appear in the output as 1-based.
*/
char* GetHOCRText(int page_number);
/**
* The recognized text is returned as a char* which is coded in the same
* format as a box file used in training. Returned string must be freed with
* the delete [] operator.
* Constructs coordinates in the original image - not just the rectangle.
* page_number is a 0-based page index that will appear in the box file.
*/
char* GetBoxText(int page_number);
/**
* The recognized text is returned as a char* which is coded
* as UNLV format Latin-1 with specific reject and suspect codes
* and must be freed with the delete [] operator.
*/
char* GetUNLVText();
/** Returns the (average) confidence value between 0 and 100. */
int MeanTextConf();
/**
* Returns all word confidences (between 0 and 100) in an array, terminated
* by -1. The calling function must delete [] after use.
* The number of confidences should correspond to the number of space-
* delimited words in GetUTF8Text.
*/
int* AllWordConfidences();
/**
* Applies the given word to the adaptive classifier if possible.
* The word must be SPACE-DELIMITED UTF-8 - l i k e t h i s , so it can
* tell the boundaries of the graphemes.
* Assumes that SetImage/SetRectangle have been used to set the image
* to the given word. The mode arg should be PSM_SINGLE_WORD or
* PSM_CIRCLE_WORD, as that will be used to control layout analysis.
* The currently set PageSegMode is preserved.
* Returns false if adaption was not possible for some reason.
*/
我原来的回答
为此,您必须编写自己的包装器。
python-tesseract 很好,因为它可以让您快速 运行,但这不是我所说的复杂。您可以阅读源代码并了解它是如何工作的,但这里是概要:
将输入图像写入临时文件
对该文件调用 tesseract 命令(从命令行)
Return 结果
所以如果你想做一些特别的事情,这根本行不通。
我有一个需要高性能的应用程序,等待文件写入磁盘、等待 tesseract 启动并加载图像并处理它所花费的时间太多了。
如果我没记错的话(我无法再访问源代码了)我使用 ctypes 加载了一个 tesseract 进程,设置图像数据,然后调用 GetHOCRText 方法。然后,当我需要处理另一张图像时,我不必等待 tesseract 再次加载,我只需设置图像数据并再次调用 GetHOCRText。
所以这不是您问题的精确解决方案,而且绝对不是您可以使用的代码片段。但希望它能帮助您在实现目标方面取得一些进展。
这是关于包装外部库的另一个问题:Wrapping a C library in Python: C, Cython or ctypes?
我正在尝试为 python-tesseract 构建一个 OCR 扩展,它专门处理具有内部结构的数据表(例如,包含行和列的小计和总计,允许用户提高准确性强制结构)。
我正在尝试访问 tesseract 分配给多个结果的置信度(例如,所有结果都来自不受约束的 运行,所有结果都来自字符数限制为 [0-9\.]
的 运行)。
我看过一些关于访问 GetHOCRText
api 方法的 x_wconf
属性的信息,但一直无法弄清楚如何从 python api。你如何call/access这个值?谢谢!
我在 OSX 10.10.3 和 Python 2.7.
上使用 python-tesseract 0.9.1编辑
其实我都错了,我在想 pytesseract,而不是 python-tesseract。
如果您查看 API 源代码 (baseapi_mini.h),您会发现有些函数听起来非常适合您尝试执行的操作。您感兴趣的部分从第 500 行左右开始。
char* GetUTF8Text();
/**
* Make a HTML-formatted string with hOCR markup from the internal
* data structures.
* page_number is 0-based but will appear in the output as 1-based.
*/
char* GetHOCRText(int page_number);
/**
* The recognized text is returned as a char* which is coded in the same
* format as a box file used in training. Returned string must be freed with
* the delete [] operator.
* Constructs coordinates in the original image - not just the rectangle.
* page_number is a 0-based page index that will appear in the box file.
*/
char* GetBoxText(int page_number);
/**
* The recognized text is returned as a char* which is coded
* as UNLV format Latin-1 with specific reject and suspect codes
* and must be freed with the delete [] operator.
*/
char* GetUNLVText();
/** Returns the (average) confidence value between 0 and 100. */
int MeanTextConf();
/**
* Returns all word confidences (between 0 and 100) in an array, terminated
* by -1. The calling function must delete [] after use.
* The number of confidences should correspond to the number of space-
* delimited words in GetUTF8Text.
*/
int* AllWordConfidences();
/**
* Applies the given word to the adaptive classifier if possible.
* The word must be SPACE-DELIMITED UTF-8 - l i k e t h i s , so it can
* tell the boundaries of the graphemes.
* Assumes that SetImage/SetRectangle have been used to set the image
* to the given word. The mode arg should be PSM_SINGLE_WORD or
* PSM_CIRCLE_WORD, as that will be used to control layout analysis.
* The currently set PageSegMode is preserved.
* Returns false if adaption was not possible for some reason.
*/
我原来的回答
为此,您必须编写自己的包装器。
python-tesseract 很好,因为它可以让您快速 运行,但这不是我所说的复杂。您可以阅读源代码并了解它是如何工作的,但这里是概要:
将输入图像写入临时文件
对该文件调用 tesseract 命令(从命令行)
Return 结果
所以如果你想做一些特别的事情,这根本行不通。
我有一个需要高性能的应用程序,等待文件写入磁盘、等待 tesseract 启动并加载图像并处理它所花费的时间太多了。
如果我没记错的话(我无法再访问源代码了)我使用 ctypes 加载了一个 tesseract 进程,设置图像数据,然后调用 GetHOCRText 方法。然后,当我需要处理另一张图像时,我不必等待 tesseract 再次加载,我只需设置图像数据并再次调用 GetHOCRText。
所以这不是您问题的精确解决方案,而且绝对不是您可以使用的代码片段。但希望它能帮助您在实现目标方面取得一些进展。
这是关于包装外部库的另一个问题:Wrapping a C library in Python: C, Cython or ctypes?