从单词列表和句子列表创建平行语料库 (Python)

Question

我正在尝试为监督机器学习创建一个并行语料库。

基本上我想要两个文件，一个每行一个完整的句子，另一个只包含与同一行中的句子相对应的特定手动提取术语。

我已经创建了每行一个句子的文件；现在我想用每行中的术语生成标签文件。为了说明，这是我想出的代码：

import re

list_of_terms = ["cake", "cola", "water", "stop"]
sentences = ["Let's eat some cake.", "I'd like to have some cola to go with the cake.", "stop eating all this cake, you waterstopper", "I will never eat this again", "cake and cola and water"]
para = []
for line in sentences:
    s = re.findall(r"(?=\b("+'|'.join(list_of_terms)+r")\b)", line)
    para.append(s)
print(*para, sep = "\n")

这导致我想要的输出：

['cake']
['cola', 'cake']
['stop', 'cake']
[]
['cake', 'cola', 'water']

不幸的是，对于我正在处理的语料库，代码不能很好地工作。事实上，我遇到了 3 种不同的异常。

对于一个语料库，re.findall 函数始终为每个术语输出和附加 ''。

[('criminal', ''), ('liability', ''), ('legal', ''), ('fiscal', ''), ('criminal', ''), ('law', '')]

我解决了这个问题，多亏了这个帖子中的最后一条评论：Use of findall and parenthesis in Python

[x if x!='' else y for x,y in re.findall(r"(?=\b("+'|'.join(list_of_terms)+r")\b)]

但是，此方法会引发 ValueError，因为正则表达式不会为我正在使用的其他两个语料库创建 ''。对于那些我只是使用 try except - block 和运行示例代码，结果令人满意。但为什么在这种情况下正则表达式不创建 ''？
最后，另一个 corpra 提出了一个 re.error“re.error：在位置 4950 没有什么可重复的”，我还没有找到解决这个问题的方法。我怀疑“list_of_terms”中有特殊字符；有什么办法可以预先过滤掉这些吗？

不用说，我对编码还是很陌生，因为我的背景是翻译而不是计算机科学。所以一个优雅的答案将不胜感激！ :)

P.S.: 我使用的语料库都在ACTER Corpus-Collection: https://github.com/AylaRT/ACTER

Answer 1

您需要 re.escape list_of_terms 列表中的每一项，并使用明确的词边界：

re.findall(r"(?=(?<!\w)("+'|'.join(map(re.escape, list_of_terms))+r")(?!\w))", line)

(?<!\w) 否定后视匹配一个没有紧跟字符字符（数字、字母或 _）的位置。

(?!\w) 否定先行匹配一个没有紧跟单词 char 的位置。

从单词列表和句子列表创建平行语料库 (Python)

Creating a parallel corpus from list of words and list of sentences (Python)

python

regex

nlp

python-re