在文档中查找独特的句子
Find unique sentences in a document
我有一个包含大约 30 万个句子的文本语料库。我只想有独特的句子,这意味着如果我有一个句子的频率是两倍,我只想有一个。
这是我在 python 3:
中尝试过的
def unique_sentences(data):
u_sent = list(set([w for w in data.split('.')]))
return ".".join(u_sent)
问题是它也删除了独特的句子。你知道 python 中的任何其他方法吗?
我建议使用像 NLTK 这样的知名库来拆分文本数据。当我 运行 您的示例文本代码时,我得到了以下结果:
输入: 'This is an example. It is another one. This is the third one. This is an example. This is an example.'
输出: .This is an example. This is the third one. This is an example. It is another one
但是当我使用 NLTK 库使用以下代码拆分句子时,我得到了想要的结果:
from nltk.tokenize import sent_tokenize
import nltk
nltk.download('punkt')
unique_sentences = set(sent_tokenize(data))
输出: {'It is another one.', 'This is the third one.', 'This is an example.'}
此外,如果您关心句子的顺序,可以使用以下方法获取独特的句子:
from collections import OrderedDict
unique_ordered = list(OrderedDict.fromkeys(sent_tokenize(data)))
output = ' '.join(unique_ordered)
输出: This is an example. It is another one. This is the third one.
我有一个包含大约 30 万个句子的文本语料库。我只想有独特的句子,这意味着如果我有一个句子的频率是两倍,我只想有一个。
这是我在 python 3:
中尝试过的def unique_sentences(data):
u_sent = list(set([w for w in data.split('.')]))
return ".".join(u_sent)
问题是它也删除了独特的句子。你知道 python 中的任何其他方法吗?
我建议使用像 NLTK 这样的知名库来拆分文本数据。当我 运行 您的示例文本代码时,我得到了以下结果:
输入: 'This is an example. It is another one. This is the third one. This is an example. This is an example.'
输出: .This is an example. This is the third one. This is an example. It is another one
但是当我使用 NLTK 库使用以下代码拆分句子时,我得到了想要的结果:
from nltk.tokenize import sent_tokenize
import nltk
nltk.download('punkt')
unique_sentences = set(sent_tokenize(data))
输出: {'It is another one.', 'This is the third one.', 'This is an example.'}
此外,如果您关心句子的顺序,可以使用以下方法获取独特的句子:
from collections import OrderedDict
unique_ordered = list(OrderedDict.fromkeys(sent_tokenize(data)))
output = ' '.join(unique_ordered)
输出: This is an example. It is another one. This is the third one.