如何为 Textricator PDF OCR reader 设置 FSM 配置？

Question

我正在尝试使用名为 Textricator 的 PDF 文档解析器。它可以使用 3 种不同的方法使用一些常见的 OCR 库来解析 PDF。 (itext5, itext7, pdfbox) 可用的方法有：text、table 和form。 Text 用于普通原始 OCR 识别，table 用于读取结构化 table 数据，form 使用 有限状态机 (FSM) 解析较少结构化的形式。

但是，我无法使用 form 解析器。也许我根本不明白如何组织许多配置状态。该文档缺少一个简单的表单示例，最近有人使用 form 方法发布了一个 attempt to read a very basic table，但无法发布。我也试了一下，但是没有成功。

问：谁能帮我配置YML文件中的状态机？
（这用于从该回购的问题之一解析演示文件，并显示在下面复制的屏幕截图中。）

YML 配置文件。


extractor: "pdf.pdfbox"

header:
  default: 100
footer:
   default: 600

maxRowDistance: 2

rootRecordType: item
recordTypes:
  item:
    label: "item"
    valueTypes:
      - item
      - date
      - description
      - order_number
      - quantity
      - price

valueTypes:
  item:
    label: "Item"
  date:
    label: "Date"
  description:
    label: "Description"
  order_number:
    label: "OrderNo"
  quantity:
    label: "Qty"
  price:
    label: "Price"
 
initialState: "INIT"

states:
  INIT:
    transitions:
      -
        condition: item
        nextState: item

  item:
    startRecord: true
    transitions:
      -
        condition: date
        nextState: date  

  date:
    include: true
    transitions:
      -
        condition: description
        nextState: description  

  description:
    include: true
    transitions:
      -
        condition: description
        nextState: description     
      -
        condition: order_number
        nextState: order_number
      -
        condition: quantity
        nextState: quantity

  order_number:
    include: true
    transitions:
      -
        condition: order_number
        nextState: order_number
      -
        condition: quantity
        nextState: quantity

  quantity:
    include: true
    transitions:
      -
        condition: price
        nextState: price

  price:
    include: true
    transitions:
      -
        condition: end
        nextState: end

  end:
    include: false
    transitions:
      -
        condition: any
        nextState: end

conditions:

  item:         '73 < ulx < 110 and text =~ /(\d)*/'
  date:         '110 < ulx < 181 and text =~ /([0-9\-]*)/'
  description:  '193 < ulx < 366'
#  order_number: '12 <= uly_rel <= 16 and text =~ ^.+/((\d{6})\-)((\d{2}))/'
  order_number: '12 <= uly_rel <= 16 and text =~ ^.+((\d{6})\-)((\d{2}))'
  quantity:     '393 < ulx < 459'
  price:        '459 < ulx < 523'

  end:          'text =~ /(Footer)/'
  any: "1 = 1"

你可能想知道为什么我坚持使用 form 处理器来处理这个简单的例子，但这是因为在我的现实生活文档中我会有一个更复杂的子- Description 字段下子项的结构。这只能（？）由状态机 AFAIK 有效处理。

但是，也许这不是完成这项工作的正确工具？那么还有哪些选择呢？

更新： (2021-05-18)

Textricate 的作者现在修改了使用的库、文档并更正了几个工作示例和用户问题。感谢用户 mweber 我现在有了一个完美工作的解析器，不再需要使用 awk 到 handle weird columns.

Answer 1

由于 Textricator 是一种隐藏的 gem 用于 pdf 解析 imo，我很高兴看到有人使用它并将使用示例文档的配置发布到 github 问题：

extractor: "pdf.pdfbox"

header:
  default: 100
footer:
  default: 600

maxRowDistance: 2

rootRecordType: item
recordTypes:
  item:
    label: "item"
    valueTypes:
      - item
      - date
      - description
      - order_number
      - quantity
      - price

valueTypes:
  item:
    label: "Item"
  date:
    label: "Date"
  description:
    label: "Description"
  order_number:
    label: "OrderNo"
  quantity:
    label: "Qty"
  price:
    label: "Price"

initialState: "INIT"

states:
  INIT:
    include: false
    transitions:
      -
        condition: item
        nextState: item
      - condition: any
        nextState: INIT

  item:
    startRecord: true
    transitions:
      -
        condition: date
        nextState: date  

  date:
    include: true
    transitions:
      -
        condition: description
        nextState: description  

  description:
    include: true
    transitions:
      -
        condition: description
        nextState: description     
      -
        condition: order_number
        nextState: order_number
      -
        condition: quantity
        nextState: quantity
      -
        condition: item
        nextState: item

  order_number:
    include: true
    transitions:
      -
        condition: order_number
        nextState: order_number
      -
        condition: quantity
        nextState: quantity

  quantity:
    include: true
    transitions:
      - 
        condition: price
        nextState: price

  price:
    include: true
    transitions:
      -
        condition: end
        nextState: end
      - 
        condition: description
        nextState: description
      -
        condition: item
        nextState: item

  end:
    include: false
    transitions:
      -
        condition: any
        nextState: end

conditions:

  item:         '73 < ulx < 110 and text =~ /(\d)*/'
  date:         '110 < ulx < 181 and text =~ /([0-9\-]*)/'
  description:  '193 < ulx < 366'
  order_number: '12 <= uly_rel <= 16 and text =~ /^.+(([0-9]{6})\-)(([0-9]{2}))/'
  quantity:     '393 < ulx < 459'
  price:        '459 < ulx < 523'

  end:          'text =~ /(Footer)/'
  any: "1 = 1"

如何为 Textricator PDF OCR reader 设置 FSM 配置？

How to set the FSM configuaration for Textricator PDF OCR reader?

ocr

text-extraction

itext

pdfbox