从标签不明确的结构化文档中解析数据

Parsing data from structured document with ambiguous labeling

我正在尝试将法律文档从古老的 SGML 文件移动到数据库中。在 java 中使用正则表达式,我运气不错。不过,我已经运行陷入了一个小问题。文档的每个部分的标签似乎在文档之间不是标准的。例如,最常见的标签是:

(<numeric>)
    (<alpah>)
        (<ROMAN>)
            (<ALPHA>)

例如。 (1)(一)(一)(一)

但是,还有其他文档有变体,其中可能会出现 ()。我当前的算法具有与每个级别的每个元素匹配的硬编码 RegEx。但是我需要一种方法来在我浏览文档时为每个级别动态设置标签类型。

有人遇到过这样的问题吗?有人有什么建议吗?

提前致谢。

编辑:

这是我用来解析不同项目的正则表达式:

Section: ^<tab>(<b>)?\d{1,4}(\.\d+)?-((\d{1,4}(\.\d+)?)(-|\.)?){3}
SubSection: \.?\s*(<\/b>|<tab>|^)\s*\(\d+(\.\d+)?\)\s+($|<b>|[A-Z"]|\([a-z](.\d+)?\)\s*(\((XC|XL|L?X{0,3})(IX|IV|V?I{0,3})(\.\d+)?\)\s*(\([A-Z](.\d+)?\))?)?\s*.)
Paragraph: (^|<tab>|\s+|\(\d+(\.\d+)?\)\s+)\([a-z](.\d+)?\)(\s+$|\s+<b>|\s+[A-Z"]|\s*\((XC|XL|L?X{0,3})(IX|IV|V?I{0,3})(\.\d+)?\)(\([A-Z](.\d+)?\))?\s*[A-Z"]?)
SubParagraph: (\)|<tab>|<\/b>)\s*\((XC|XL|L?X{0,3})(IX|IV|V?I{0,3})(\.\d+)?\)\s+($|[A-Z"<]|\([A-Z](.\d+)?\)\s*[A-Z"])
SubSubParagraph: (<tab>|\)\s*)\([A-Z](.\d+)?\)\s+([A-Z"]|$)

这是一些示例文本。我之前说错了。虽然数据的最终来源是 SGML,但我解析的内容略有不同。除了有style标签外,多多少少都是纯文本。

<tab><b>SECTION 5.</b>  In Colorado Revised Statutes, 13-5-142, <b>amend</b> (1)
introductory portion, (1)(b), and (3)(b)(II) as follows:

<tab><b>13-5-142.  National instant criminal background check system - reporting.</b>
(1)  On and after March 20, 2013, the state court administrator shall send electronically
the following information to the Colorado bureau of investigation created pursuant to
section 24-33.5-401, referred to in this section as the "bureau":

<tab>(b)  The name of each person who has been committed by order of the court to the
custody of the office of behavioral health in the department of human services pursuant
to section 27-81-112 or 27-82-108; and

<tab>(3)  The state court administrator shall take all necessary steps to cancel a record
made by the state court administrator in the national instant criminal background check
system if:

<tab>(b)  No less than three years before the date of the written request:

<tab>(II)  The period of commitment of the most recent order of commitment or
recommitment expired, or a court entered an order terminating the person's incapacity or
discharging the person from commitment in the nature of habeas corpus, if the record in
the national instant criminal background check system is based on an order of
commitment to the custody of the office of behavioral health in the department of human
services; except that the state court administrator shall not cancel any record pertaining to
a person with respect to whom two recommitment orders have been entered pursuant to
section 27-81-112 (7) and (8), or who was discharged from treatment pursuant to section
27-81-112 (11) on the grounds that further treatment is not likely to bring about
significant improvement in the person's condition; or

您对问题的描述含糊不清,因此唯一可能的答案是通用方法。我曾处理过像这样格式不精确的文档转换。

CS 中可以提供帮助的工具是状态机。如果您可以检测到(例如使用正则表达式)格式正在更改为新约定,那么这是合适的。这改变了状态,在这种情况下相当于翻译器用于当前和后续文本块。它一直有效,直到下一次状态更改。总体而言,算法如下所示:

translator = DEFAULT 
while (chunks of input remain) {
  chunk = GetNextChunkOfInput // a line, paragraph, etc.
  new_translator = ScanChunkForStateChange(chunk, translator)
  if (new_translator != null) translator = new_translator // found a state change!
  print(translator.Translate(chunk))  // use the translator on the chunk
}

在此框架内,设计转换器和状态更改谓词是一个繁琐的过程。您所能做的就是尝试、检查输出并解决问题,不断重复,直到您再也无法改进为止。那时您可能已经在输入中发现了最大结构,因此仅使用模式匹配的算法(无需尝试对语义进行建模,例如使用 AI)不会让您走得更远。

您发布的文本片段可以由 SGML 解析器在 DOCTYPE aka DTD 中使用自定义语法规则进行解析和结构化(假设您示例中的 <tab> 代表实际的 tab开始元素标记而不是 TAB 字符)。我已经获取了您的代码段,将其存储在名为 data.ent 的文件中,然后创建了以下 SGML 文件 doc.sgm,引用它:

<!DOCTYPE doc [
    <!ELEMENT doc O O (tab)+>
    <!ELEMENT tab - O (((b,c?)|c),text)>
    <!ELEMENT text O O (#PCDATA|b)+>
    <!ELEMENT b - - (#PCDATA)>
    <!ELEMENT c - - (#PCDATA)>
    <!ENTITY data SYSTEM "data.ent">
    <!ENTITY startc "<c>">
    <!ENTITY endc "</c>">
    <!SHORTREF intab "(" startc ")" endc>
    <!USEMAP intab tab>
    <!USEMAP #EMPTY text>
]>
&data

使用这些 DTD 规则(在命令行上使用 osgmlnorm doc.sgm)解析您的数据的结果如下:

<DOC>
  <TAB>
    <B>SECTION 5.</B>
    <TEXT>In Colorado Revised Statutes, 13-5-142, <B>amend</B> (1)
      introductory portion, (1)(b), and (3)(b)(II) as follows:
    </TEXT>
  </TAB>
  <TAB>
    <B>13-5-142.  National instant criminal background check system
      reporting.</B>
    <C>1</C>
    <TEXT>On and after March 20, 2013, the state court administrator
      shall send electronically the following information to the
      Colorado bureau of investigation created pursuant to section
      24-33.5-401, referred to in this section as the "bureau":
    </TEXT>
  </TAB>
  <TAB>
    <C>b</C>
    <TEXT>The name of each person who has been committed by order
      of the court to the custody of the office of behavioral health
      in the department of human services pursuant to section 27-81-112
      or 27-82-108; and
    </TEXT>
  </TAB>
  <TAB>
    <C>3</C>
    <TEXT>The state court administrator shall take all necessary steps
      to cancel a record made by the state court administrator in the
      national instant criminal background check system if:
    </TEXT>
  </TAB>
  <TAB>
    <C>b</C>
    <TEXT>No less than three years before the date of the written
      request:
    </TEXT>
  </TAB>
  <TAB>
    <C>II</C>
    <TEXT>The period of commitment of the most recent order of
      commitment or recommitment expired, or a court entered an order
      terminating the person's incapacity or discharging the person
      from commitment in the nature of habeas corpus, if the record in 
      the national instant criminal background check system is based on
      an order of commitment to the custody of the office of behavioral
      health in the department of human services; except that the state
      court administrator shall not cancel any record pertaining to
      a person with respect to whom two recommitment orders have been
      entered pursuant to section 27-81-112 (7) and (8), or who was
      discharged from treatment pursuant to section 27-81-112 (11) on
      the grounds that further treatment is not likely to bring about
      significant improvement in the person's condition; or
    </TEXT>
  </TAB>
</DOC>

解释:

  • 我创建的 SGML DTD 使用 SGML 标签推断来推断虚构的 DOC 元素作为文档元素,以及人工 TEXTC 元素; 主要目的是将文档结构强加为一系列 TAB 个元素,每个元素包含一个节标识符(例如 <b>SECTION 5.</b>(c)),后跟节正文
  • 我还制作了一个临时元素 C 用于包装部分标识符 放在大括号中的文本(() 字符);开始-结束结束元素 C 的标签由 SGML 处理器自动插入,因为 DTD的SHORTREF映射规则;这些告诉 SGML 在 TAB 元素,SGML 应该用元素的值替换所有 ( 个字符 startc 实体(扩展为 <C>),以及所有 ) 个字符 endc 实体的值(扩展为 </C>
  • <!USEMAP #EMPTY text> 关闭括号中的扩展 TEXT TAB 部分的正文部分,因此引用 (7)(8) 正文文本不会更改(尽管这些可以更改为 HTML-like 链接以及使用 SGML)

如果您使用 <tab> 表示制表符 (ASCII 9) 字符,SGML 也可以处理它,例如。使用与所示规则类似的 SHORTREF 规则将 TAB 字符转换为 <TAB> 标签。

请注意,您需要安装 osgmlnorm 程序;如果您使用 Ubuntu,可以使用 sudo apt-get install opensp 安装它,在其他 Linux 变体和 Mac OS 上也类似。对于您的应用程序,您可能希望使用 osx 程序(也是 OpenSP 的一部分)将规范化解析结果输出到 XML(尽管上面显示的输出已经可以解析为 XML),然后使用 Java XML API 来处理您需要的结构化内容。