单词出现之间的距离
Distance between occurrences of a word
我有一个包含一些句子的文本文件。假设有三个句子"Rahul backed from the market.","We are going to market","All the shops are closed in the market."
现在我需要计算单词 "market" 出现的距离。
这里应该是 5 和 8,因为单词 "market" 出现在单词 "market" 第一次出现的第 5 个单词之后,依此类推。
我正在使用 nltk word tokenizer 来获取单词。实际上我需要为语料库中的大部分单词做这件事。
如果您的单词列表是有序的,您可以枚举它们并进行查找,其中键是单词,值是找到单词的索引列表:
import re
from collections import defaultdict
s = "Rahul backed from the market. We are going to market All the shops are closed in the market."
# using re for simplicity
words = re.findall(r'\w+', s)
positions = defaultdict(list)
for index, word in enumerate(words):
positions[word].append(index)
positions
看起来像:
defaultdict(list,
{'Rahul': [0],
'backed': [1],
'from': [2],
'the': [3, 11, 16],
'market': [4, 9, 17],
'We': [5],
'are': [6, 13],
'going': [7],
'to': [8],
'All': [10],
'shops': [12],
'closed': [14],
'in': [15]})
有了它,您可以通过压缩列表并减去索引来计算距离:
distances = {}
for word, l in positions.items():
distances[word] = [m - n for n, m in zip(l, l[1:])]
现在distances
是一个单词之间距离的字典。只有一个词的项目是空列表,因为距离在这里没有意义:
{'Rahul': [],
'backed': [],
'from': [],
'the': [8, 5],
'market': [5, 8],
'We': [],
'are': [7],
'going': [],
'to': [],
'All': [],
'shops': [],
'closed': [],
'in': []}
我有一个包含一些句子的文本文件。假设有三个句子"Rahul backed from the market.","We are going to market","All the shops are closed in the market."
现在我需要计算单词 "market" 出现的距离。
这里应该是 5 和 8,因为单词 "market" 出现在单词 "market" 第一次出现的第 5 个单词之后,依此类推。
我正在使用 nltk word tokenizer 来获取单词。实际上我需要为语料库中的大部分单词做这件事。
如果您的单词列表是有序的,您可以枚举它们并进行查找,其中键是单词,值是找到单词的索引列表:
import re
from collections import defaultdict
s = "Rahul backed from the market. We are going to market All the shops are closed in the market."
# using re for simplicity
words = re.findall(r'\w+', s)
positions = defaultdict(list)
for index, word in enumerate(words):
positions[word].append(index)
positions
看起来像:
defaultdict(list,
{'Rahul': [0],
'backed': [1],
'from': [2],
'the': [3, 11, 16],
'market': [4, 9, 17],
'We': [5],
'are': [6, 13],
'going': [7],
'to': [8],
'All': [10],
'shops': [12],
'closed': [14],
'in': [15]})
有了它,您可以通过压缩列表并减去索引来计算距离:
distances = {}
for word, l in positions.items():
distances[word] = [m - n for n, m in zip(l, l[1:])]
现在distances
是一个单词之间距离的字典。只有一个词的项目是空列表,因为距离在这里没有意义:
{'Rahul': [],
'backed': [],
'from': [],
'the': [8, 5],
'market': [5, 8],
'We': [],
'are': [7],
'going': [],
'to': [],
'All': [],
'shops': [],
'closed': [],
'in': []}