Python - 以百分比计算词频
Python - Calculate word frequency in percentage
我有一篇文章,我计算了单词的数量和单词的频率。现在我必须按百分比显示前 7 个。我不知道该怎么做。我知道如何计算百分比 part/whole,但不确定如何编写代码。我已经在下面按值进行了排序。
def word_frequency():
"""
Function for word frequency
"""
d = dict()
with open(TEXT, "r") as f:
for line in f:
words = line.split()
for w in words:
if w in d:
d[w] += 1
else:
d[w] = 1
dict_list = sorted(d.items(), key = itemgetter(1), reverse = True)
print(dict_list[0:7])
这给了我这个列表:
[('the', 12), ('to', 8), ('of', 6), ('and', 5), ('a', 4), ('in', 4), ('Phil', 3)]
但是如何用百分比而不是数值来计算和呈现它们呢?
正文字数为199
.
此致
编辑:新修订代码
def word_frequency():
"""
Function for word frequency
"""
d = dict()
with open(TEXT, "r") as f:
for line in f:
words = line.split()
for w in words:
if w in d:
d[w] += round(1/1.99, 1)
else:
d[w] = round(1/1.99, 1)
dict_list = sorted(d.items(), key = itemgetter(1), reverse = True)
print(dict_list[0:7])
给我这个列表:
[('the', 6.0), ('to', 4.0), ('of', 3.0), ('and', 2.5), ('a', 2.0), ('in', 2.0), ('Phil', 1.5)]
我现在有了百分比,但有没有办法以更好的方式呈现它?
喜欢:
the 6%
to 4%
of 3%
and 2.5%
a 2%
in 2%
Phil 1.5%
您可以枚举字典中的项目
for k, v in dict_list.items():
percent = str(v) + ' %'
result = k + ' ' + percent
print(result)
或者,您可以使用 collections
中的 Counter
来计算单词的频率。
from operator import itemgetter
from collections import Counter
def most_common(instances):
"""Returns a list of (instance, count) sorted in total order and then from most to least common"""
return sorted(sorted(Counter(instances).items(), key=itemgetter(0)), key=itemgetter(1), reverse=True)
利用那个most_common
函数,你可以像你说的那样"calculate percentage, part/whole"。您可以通过遍历单词及其频率并将其除以单词总数来完成。
# words = list of strings
frequencies = most_common(words)
percentages = [(instance, count / len(words)) for instance, count in frequencies]
根据您的用例,re.findall(r"\w+", text)
可能不是提取单词的最佳方法。
要获得前 7 个单词,您可以切片 percentages
,方法是 percentages[:7]
。
import re
text = "Alice opened the door and found that it led into a small passage, not much larger than a rat-hole: she knelt down and looked along the passage into the loveliest garden you ever saw."
words = re.findall(r"\w+", text)
frequencies = most_common(words)
percentages = [(instance, count / len(words)) for instance, count in frequencies]
for word, percentage in percentages[:7]:
print("%s %.2f%%" % (word, percentage * 100))
输出:
the 8.57%
a 5.71%
and 5.71%
into 5.71%
passage 5.71%
Alice 2.86%
along 2.86%
同一个词如果要不同大小写,算同一个。然后你可以在调用 most_common
.
之前规范化所有的单词
import unicodedata
def normalize_caseless(text):
return unicodedata.normalize("NFKD", text.casefold())
然后:
words = ...
变为:
words = list(map(normalize_caseless, ...))
然后是一个包含不同大小写的相同单词的字符串,如下所示:
text = "Hello Test test TEST test TeSt"
结果:
test 83.33%
hello 16.67%
而不是:
test 33.33%
Hello 16.67%
TEST 16.67%
TeSt 16.67%
Test 16.67%
我有一篇文章,我计算了单词的数量和单词的频率。现在我必须按百分比显示前 7 个。我不知道该怎么做。我知道如何计算百分比 part/whole,但不确定如何编写代码。我已经在下面按值进行了排序。
def word_frequency():
"""
Function for word frequency
"""
d = dict()
with open(TEXT, "r") as f:
for line in f:
words = line.split()
for w in words:
if w in d:
d[w] += 1
else:
d[w] = 1
dict_list = sorted(d.items(), key = itemgetter(1), reverse = True)
print(dict_list[0:7])
这给了我这个列表:
[('the', 12), ('to', 8), ('of', 6), ('and', 5), ('a', 4), ('in', 4), ('Phil', 3)]
但是如何用百分比而不是数值来计算和呈现它们呢?
正文字数为199
.
此致
编辑:新修订代码
def word_frequency():
"""
Function for word frequency
"""
d = dict()
with open(TEXT, "r") as f:
for line in f:
words = line.split()
for w in words:
if w in d:
d[w] += round(1/1.99, 1)
else:
d[w] = round(1/1.99, 1)
dict_list = sorted(d.items(), key = itemgetter(1), reverse = True)
print(dict_list[0:7])
给我这个列表:
[('the', 6.0), ('to', 4.0), ('of', 3.0), ('and', 2.5), ('a', 2.0), ('in', 2.0), ('Phil', 1.5)]
我现在有了百分比,但有没有办法以更好的方式呈现它? 喜欢:
the 6%
to 4%
of 3%
and 2.5%
a 2%
in 2%
Phil 1.5%
您可以枚举字典中的项目
for k, v in dict_list.items():
percent = str(v) + ' %'
result = k + ' ' + percent
print(result)
或者,您可以使用 collections
中的 Counter
来计算单词的频率。
from operator import itemgetter
from collections import Counter
def most_common(instances):
"""Returns a list of (instance, count) sorted in total order and then from most to least common"""
return sorted(sorted(Counter(instances).items(), key=itemgetter(0)), key=itemgetter(1), reverse=True)
利用那个most_common
函数,你可以像你说的那样"calculate percentage, part/whole"。您可以通过遍历单词及其频率并将其除以单词总数来完成。
# words = list of strings
frequencies = most_common(words)
percentages = [(instance, count / len(words)) for instance, count in frequencies]
根据您的用例,re.findall(r"\w+", text)
可能不是提取单词的最佳方法。
要获得前 7 个单词,您可以切片 percentages
,方法是 percentages[:7]
。
import re
text = "Alice opened the door and found that it led into a small passage, not much larger than a rat-hole: she knelt down and looked along the passage into the loveliest garden you ever saw."
words = re.findall(r"\w+", text)
frequencies = most_common(words)
percentages = [(instance, count / len(words)) for instance, count in frequencies]
for word, percentage in percentages[:7]:
print("%s %.2f%%" % (word, percentage * 100))
输出:
the 8.57%
a 5.71%
and 5.71%
into 5.71%
passage 5.71%
Alice 2.86%
along 2.86%
同一个词如果要不同大小写,算同一个。然后你可以在调用 most_common
.
import unicodedata
def normalize_caseless(text):
return unicodedata.normalize("NFKD", text.casefold())
然后:
words = ...
变为:
words = list(map(normalize_caseless, ...))
然后是一个包含不同大小写的相同单词的字符串,如下所示:
text = "Hello Test test TEST test TeSt"
结果:
test 83.33%
hello 16.67%
而不是:
test 33.33%
Hello 16.67%
TEST 16.67%
TeSt 16.67%
Test 16.67%