从非结构化文本中逻辑提取数据

Question

使用 NLP 解析如下非结构化文本的最佳方法？模型会是最好的解决方案吗？如果是这样，最好的开始方式是什么？

Prerequisite: Mathematics 408C and 408D with a grade of a C, and Physics 301 or 301K with a grade of a C.

这会导致类似

的结果

Mathematics 408C C, Mathematics 408D C, Physics 301 C or Physics 301 C

我试过只使用正则表达式，但句子结构可能要复杂得多且不一致。喜欢下面

Prerequisite: Architecture 415K with a grade of at least C; Mathematics 408C or 408K; and Physics 302K and 102M, or 303K and 103M.

期望的结果：

Architecture 415K C, Mathematics 408C or Mathematics 408K, Physics 302K and Physics 102M or Physics 303K and Physics 103M

Answer 1

您的输入文本不是真正的自然语言（包含信息的完整句子），而是 semi-structured，因此很难用 rule-based 和 semantics-based 方法处理。

基于 semantic/neural-model 的方法，例如在 huggingface (python/PyTorch) 中使用 pre-trained 问答模型，在这里可能有点不合时宜，但它可以帮助提供一些结构这几乎独立于以前的结构：

这种方法的好处是基本上独立于输入结构，例如参见input with full sentences or a bullet point list。

由于模型只有 returns 有效输出，如果问题从上下文中可以清楚地回答，您将不得不使用 rule-based 方法来获取先决条件中提到的课程列表，然后使用第一个答案为成绩创建一个有效问题。

但老实说，我想知道有多少种可能的方式来编写这样的列表。如果这只是一小部分文本，您还想为此付出多少努力。如果您展示更多示例，我们可以讨论可能的 rule-based 方法。

Extracting Data Logically from Unstructured Text