计算Python中2个词所有组合的出现频率
Calculate the frequency of all the combination of 2 words in Python
我有一段文字。我想计算2个词的所有可能组合(2个词必须相邻)
例如:
"I have 2 laptops, I have 2 chargers"
结果应该是:
"I have": 2
"have 2": 2
"2 laptops": 1
"Laptops, I": (Dont count)
"2 chargers": 1
我试过正则表达式,但问题是它不会对一个字符串计数两次
我用过:\b[a-z]{1,20}\b \b[a-z]{1,20}\b
正文:cold chain, energy storage device, industrial cooling system
它几乎可以工作,但它不包括 "storage device"、cooling system
等词,因为它已经需要 energy storage
和 industrial cooling
感谢您的建议
您可以使用zip
获取每两个单词的组,然后使用Counter
获取频率
>>> from collections import Counter
>>> text = "I have 2 laptops, I have 2 chargers"
>>> words = text.split()
>>> d = {' '.join(words):n for words,n in Counter(zip(words, words[1:])).items() if not words[0][-1]==(',')}
>>> print (d)
{'I have': 2, 'have 2': 2, '2 laptops,': 1, '2 chargers': 1}
>>> import json
>>> print (json.dumps(d, indent=4))
{
"I have": 2,
"have 2": 2,
"2 I": 1,
"2 chargers": 1
}
我有一段文字。我想计算2个词的所有可能组合(2个词必须相邻) 例如:
"I have 2 laptops, I have 2 chargers"
结果应该是:
"I have": 2
"have 2": 2
"2 laptops": 1
"Laptops, I": (Dont count)
"2 chargers": 1
我试过正则表达式,但问题是它不会对一个字符串计数两次
我用过:\b[a-z]{1,20}\b \b[a-z]{1,20}\b
正文:cold chain, energy storage device, industrial cooling system
它几乎可以工作,但它不包括 "storage device"、cooling system
等词,因为它已经需要 energy storage
和 industrial cooling
感谢您的建议
您可以使用zip
获取每两个单词的组,然后使用Counter
获取频率
>>> from collections import Counter
>>> text = "I have 2 laptops, I have 2 chargers"
>>> words = text.split()
>>> d = {' '.join(words):n for words,n in Counter(zip(words, words[1:])).items() if not words[0][-1]==(',')}
>>> print (d)
{'I have': 2, 'have 2': 2, '2 laptops,': 1, '2 chargers': 1}
>>> import json
>>> print (json.dumps(d, indent=4))
{
"I have": 2,
"have 2": 2,
"2 I": 1,
"2 chargers": 1
}