在文档中查找独特的句子

Find unique sentences in a document

我有一个包含大约 30 万个句子的文本语料库。我只想有独特的句子,这意味着如果我有一个句子的频率是两倍,我只想有一个。

这是我在 python 3:

中尝试过的
def unique_sentences(data):
    u_sent = list(set([w for w in data.split('.')]))
    return ".".join(u_sent)

问题是它也删除了独特的句子。你知道 python 中的任何其他方法吗?

我建议使用像 NLTK 这样的知名库来拆分文本数据。当我 运行 您的示例文本代码时,我得到了以下结果:

输入: 'This is an example. It is another one. This is the third one. This is an example. This is an example.'

输出: .This is an example. This is the third one. This is an example. It is another one

但是当我使用 NLTK 库使用以下代码拆分句子时,我得到了想要的结果:

from nltk.tokenize import sent_tokenize
import nltk
nltk.download('punkt')
unique_sentences = set(sent_tokenize(data))

输出: {'It is another one.', 'This is the third one.', 'This is an example.'}

此外,如果您关心句子的顺序,可以使用以下方法获取独特的句子:

from collections import OrderedDict
unique_ordered = list(OrderedDict.fromkeys(sent_tokenize(data)))
output = ' '.join(unique_ordered)

输出: This is an example. It is another one. This is the third one.