用于解析议会辩论的语法解析器?

Grammar parser for parsing parliamentary debates?

我希望从转录工具中解析纯文本(目标是将其呈现为 LegalDocML)。

我的问题是我不知道从哪里开始学习语法分析器是一个相当陡峭的学习曲线。我正在寻找关于哪种解析器适合该问题的指导。

我的直觉是以下是 LR 语法工具的候选,因为可能有一些明确的分隔符? (演讲者全部大写,演讲者角色用方括号,演讲时间用方括号),但也有一些 NLP 需求——对于不满,演讲对象通常松散地出现在演讲的第一句话中..

如有任何建议,我们将不胜感激

作为示例:

Legislative Assembly
Thursday, 19 May 2022
               
THE SPEAKER (Mrs M.H. Roberts) took the chair at 9.00 am, acknowledged country and read prayers.
PAPER TABLED
A paper was tabled and ordered to lie upon the table of the house.
SMALL BUSINESS ASSISTANCE GRANTS
Statement by Minister for Small Business
Statement
MR D.T. PUNCH (Bunbury — Minister for Small Business) [9.01 am]: I would like to bring to the attention of the house some recent changes made by the McGowan government to the small business assistance grants. As I have previously advised the house, in February the state government announced a  million level 1 COVID-19 business assistance package, and more recently a  million package for businesses impacted by level 2 public health and social measures, taking the total committed to COVID-19 business support to almost .7 billion over the past two years. The level 1 package includes  million in rent relief assistance and the level 2 package includes a .8 million small business hardship grants program.
Last month, a revision and expansion of the small business hardship grants program was announced.
.
.
.
HOME INDEMNITY INSURANCE
Grievance
MR R.S. LOVE (Moore — Deputy Leader of the Opposition) [9.06 am]: I grieve today to the Parliamentary Secretary to the Minister for Commerce on behalf of Western Australian residents who have had their

这个问题确实处于 context-free 解析和自然语言解析之间的尴尬荒地,前者过于精确,无法处理非结构化话语,后者(据我了解目前的技术水平)不是旨在利用微妙的印刷线索。

我的建议是,无论其价值如何,您都可以使用一组特别的正则表达式来尝试捕捉印刷风格和样板短语。 (“一张纸是 tabled 并被命令放在房子的 table 上。”)这就是我在几十年前尝试用加拿大等效物做这样的事情时所做的(在 Perl 是最先进的时代),并且它大部分都有效,尽管需要一定量的手动干预。 (我的风格是使用健全性检查来尝试检测处理不当的案例并将其记录下来以供将来改进。)所有这些工作量将取决于您需要结果的精确程度。

如果您可以访问足够的计算资源,您很有可能构建一个表现合理的机器学习模型。但是你仍然需要做大量的验证和重新校准,除非你能容忍错误。