在文档中查找独特的句子

Question

我有一个包含大约 30 万个句子的文本语料库。我只想有独特的句子，这意味着如果我有一个句子的频率是两倍，我只想有一个。

这是我在 python 3:

中尝试过的

def unique_sentences(data):
    u_sent = list(set([w for w in data.split('.')]))
    return ".".join(u_sent)

问题是它也删除了独特的句子。你知道 python 中的任何其他方法吗？

Answer 1

我建议使用像 NLTK 这样的知名库来拆分文本数据。当我运行您的示例文本代码时，我得到了以下结果：

输入： 'This is an example. It is another one. This is the third one. This is an example. This is an example.'

输出： .This is an example. This is the third one. This is an example. It is another one

但是当我使用 NLTK 库使用以下代码拆分句子时，我得到了想要的结果：

from nltk.tokenize import sent_tokenize
import nltk
nltk.download('punkt')
unique_sentences = set(sent_tokenize(data))

输出： {'It is another one.', 'This is the third one.', 'This is an example.'}

此外，如果您关心句子的顺序，可以使用以下方法获取独特的句子：

from collections import OrderedDict
unique_ordered = list(OrderedDict.fromkeys(sent_tokenize(data)))
output = ' '.join(unique_ordered)

输出： This is an example. It is another one. This is the third one.

Find unique sentences in a document