如何为 Textricator PDF OCR reader 设置 FSM 配置?
How to set the FSM configuaration for Textricator PDF OCR reader?
我正在尝试使用名为 Textricator 的 PDF 文档解析器。它可以使用 3 种不同的方法使用一些常见的 OCR 库来解析 PDF。 (itext5, itext7, pdfbox) 可用的方法有:text
、table
和form
。 Text 用于普通原始 OCR 识别,table 用于读取结构化 table 数据,form 使用 有限状态机 (FSM) 解析较少结构化的形式。
但是,我无法使用 form 解析器。也许我根本不明白如何组织许多配置状态。该文档缺少一个简单的表单示例,最近有人使用 form
方法发布了一个 attempt to read a very basic table,但无法发布。我也试了一下,但是没有成功。
问:谁能帮我配置YML文件中的状态机?
(这用于从该回购的 问题 之一解析演示文件,并显示在下面复制的屏幕截图中。)
YML 配置文件。
extractor: "pdf.pdfbox"
header:
default: 100
footer:
default: 600
maxRowDistance: 2
rootRecordType: item
recordTypes:
item:
label: "item"
valueTypes:
- item
- date
- description
- order_number
- quantity
- price
valueTypes:
item:
label: "Item"
date:
label: "Date"
description:
label: "Description"
order_number:
label: "OrderNo"
quantity:
label: "Qty"
price:
label: "Price"
initialState: "INIT"
states:
INIT:
transitions:
-
condition: item
nextState: item
item:
startRecord: true
transitions:
-
condition: date
nextState: date
date:
include: true
transitions:
-
condition: description
nextState: description
description:
include: true
transitions:
-
condition: description
nextState: description
-
condition: order_number
nextState: order_number
-
condition: quantity
nextState: quantity
order_number:
include: true
transitions:
-
condition: order_number
nextState: order_number
-
condition: quantity
nextState: quantity
quantity:
include: true
transitions:
-
condition: price
nextState: price
price:
include: true
transitions:
-
condition: end
nextState: end
end:
include: false
transitions:
-
condition: any
nextState: end
conditions:
item: '73 < ulx < 110 and text =~ /(\d)*/'
date: '110 < ulx < 181 and text =~ /([0-9\-]*)/'
description: '193 < ulx < 366'
# order_number: '12 <= uly_rel <= 16 and text =~ ^.+/((\d{6})\-)((\d{2}))/'
order_number: '12 <= uly_rel <= 16 and text =~ ^.+((\d{6})\-)((\d{2}))'
quantity: '393 < ulx < 459'
price: '459 < ulx < 523'
end: 'text =~ /(Footer)/'
any: "1 = 1"
你可能想知道为什么我坚持使用 form 处理器来处理这个简单的例子,但这是因为在我的现实生活文档中我会有一个更复杂的子- Description 字段下子项的结构。这只能(?)由状态机 AFAIK 有效处理。
但是,也许这不是完成这项工作的正确工具?那么还有哪些选择呢?
更新: (2021-05-18)
Textricate 的作者现在修改了使用的库、文档并更正了几个工作示例和用户问题。感谢用户 mweber 我现在有了一个完美工作的解析器,不再需要使用 awk 到 handle weird columns.
由于 Textricator 是一种隐藏的 gem 用于 pdf 解析 imo,我很高兴看到有人使用它并将使用示例文档的配置发布到 github 问题:
extractor: "pdf.pdfbox"
header:
default: 100
footer:
default: 600
maxRowDistance: 2
rootRecordType: item
recordTypes:
item:
label: "item"
valueTypes:
- item
- date
- description
- order_number
- quantity
- price
valueTypes:
item:
label: "Item"
date:
label: "Date"
description:
label: "Description"
order_number:
label: "OrderNo"
quantity:
label: "Qty"
price:
label: "Price"
initialState: "INIT"
states:
INIT:
include: false
transitions:
-
condition: item
nextState: item
- condition: any
nextState: INIT
item:
startRecord: true
transitions:
-
condition: date
nextState: date
date:
include: true
transitions:
-
condition: description
nextState: description
description:
include: true
transitions:
-
condition: description
nextState: description
-
condition: order_number
nextState: order_number
-
condition: quantity
nextState: quantity
-
condition: item
nextState: item
order_number:
include: true
transitions:
-
condition: order_number
nextState: order_number
-
condition: quantity
nextState: quantity
quantity:
include: true
transitions:
-
condition: price
nextState: price
price:
include: true
transitions:
-
condition: end
nextState: end
-
condition: description
nextState: description
-
condition: item
nextState: item
end:
include: false
transitions:
-
condition: any
nextState: end
conditions:
item: '73 < ulx < 110 and text =~ /(\d)*/'
date: '110 < ulx < 181 and text =~ /([0-9\-]*)/'
description: '193 < ulx < 366'
order_number: '12 <= uly_rel <= 16 and text =~ /^.+(([0-9]{6})\-)(([0-9]{2}))/'
quantity: '393 < ulx < 459'
price: '459 < ulx < 523'
end: 'text =~ /(Footer)/'
any: "1 = 1"
我正在尝试使用名为 Textricator 的 PDF 文档解析器。它可以使用 3 种不同的方法使用一些常见的 OCR 库来解析 PDF。 (itext5, itext7, pdfbox) 可用的方法有:text
、table
和form
。 Text 用于普通原始 OCR 识别,table 用于读取结构化 table 数据,form 使用 有限状态机 (FSM) 解析较少结构化的形式。
但是,我无法使用 form 解析器。也许我根本不明白如何组织许多配置状态。该文档缺少一个简单的表单示例,最近有人使用 form
方法发布了一个 attempt to read a very basic table,但无法发布。我也试了一下,但是没有成功。
问:谁能帮我配置YML文件中的状态机?
(这用于从该回购的 问题 之一解析演示文件,并显示在下面复制的屏幕截图中。)
YML 配置文件。
extractor: "pdf.pdfbox"
header:
default: 100
footer:
default: 600
maxRowDistance: 2
rootRecordType: item
recordTypes:
item:
label: "item"
valueTypes:
- item
- date
- description
- order_number
- quantity
- price
valueTypes:
item:
label: "Item"
date:
label: "Date"
description:
label: "Description"
order_number:
label: "OrderNo"
quantity:
label: "Qty"
price:
label: "Price"
initialState: "INIT"
states:
INIT:
transitions:
-
condition: item
nextState: item
item:
startRecord: true
transitions:
-
condition: date
nextState: date
date:
include: true
transitions:
-
condition: description
nextState: description
description:
include: true
transitions:
-
condition: description
nextState: description
-
condition: order_number
nextState: order_number
-
condition: quantity
nextState: quantity
order_number:
include: true
transitions:
-
condition: order_number
nextState: order_number
-
condition: quantity
nextState: quantity
quantity:
include: true
transitions:
-
condition: price
nextState: price
price:
include: true
transitions:
-
condition: end
nextState: end
end:
include: false
transitions:
-
condition: any
nextState: end
conditions:
item: '73 < ulx < 110 and text =~ /(\d)*/'
date: '110 < ulx < 181 and text =~ /([0-9\-]*)/'
description: '193 < ulx < 366'
# order_number: '12 <= uly_rel <= 16 and text =~ ^.+/((\d{6})\-)((\d{2}))/'
order_number: '12 <= uly_rel <= 16 and text =~ ^.+((\d{6})\-)((\d{2}))'
quantity: '393 < ulx < 459'
price: '459 < ulx < 523'
end: 'text =~ /(Footer)/'
any: "1 = 1"
你可能想知道为什么我坚持使用 form 处理器来处理这个简单的例子,但这是因为在我的现实生活文档中我会有一个更复杂的子- Description 字段下子项的结构。这只能(?)由状态机 AFAIK 有效处理。
但是,也许这不是完成这项工作的正确工具?那么还有哪些选择呢?
更新: (2021-05-18)
Textricate 的作者现在修改了使用的库、文档并更正了几个工作示例和用户问题。感谢用户 mweber 我现在有了一个完美工作的解析器,不再需要使用 awk 到 handle weird columns.
由于 Textricator 是一种隐藏的 gem 用于 pdf 解析 imo,我很高兴看到有人使用它并将使用示例文档的配置发布到 github 问题:
extractor: "pdf.pdfbox"
header:
default: 100
footer:
default: 600
maxRowDistance: 2
rootRecordType: item
recordTypes:
item:
label: "item"
valueTypes:
- item
- date
- description
- order_number
- quantity
- price
valueTypes:
item:
label: "Item"
date:
label: "Date"
description:
label: "Description"
order_number:
label: "OrderNo"
quantity:
label: "Qty"
price:
label: "Price"
initialState: "INIT"
states:
INIT:
include: false
transitions:
-
condition: item
nextState: item
- condition: any
nextState: INIT
item:
startRecord: true
transitions:
-
condition: date
nextState: date
date:
include: true
transitions:
-
condition: description
nextState: description
description:
include: true
transitions:
-
condition: description
nextState: description
-
condition: order_number
nextState: order_number
-
condition: quantity
nextState: quantity
-
condition: item
nextState: item
order_number:
include: true
transitions:
-
condition: order_number
nextState: order_number
-
condition: quantity
nextState: quantity
quantity:
include: true
transitions:
-
condition: price
nextState: price
price:
include: true
transitions:
-
condition: end
nextState: end
-
condition: description
nextState: description
-
condition: item
nextState: item
end:
include: false
transitions:
-
condition: any
nextState: end
conditions:
item: '73 < ulx < 110 and text =~ /(\d)*/'
date: '110 < ulx < 181 and text =~ /([0-9\-]*)/'
description: '193 < ulx < 366'
order_number: '12 <= uly_rel <= 16 and text =~ /^.+(([0-9]{6})\-)(([0-9]{2}))/'
quantity: '393 < ulx < 459'
price: '459 < ulx < 523'
end: 'text =~ /(Footer)/'
any: "1 = 1"