Python- 显示 table 中的常用词并跳过某些词
Python- displaying frequent words in a table and skipping certain words
目前我正在对一个文本文件进行频率分析,显示该文本文件中最常用的前 100 个单词。目前我正在使用此代码:
from collections import Counter
import re
words = re.findall(r'\w+', open('tweets.txt').read().lower())
print Counter(words).most_common (100)
上面的代码有效,输出为:
[('the', 1998), ('t', 1829), ('https', 1620), ('co', 1604), ('to', 1247), ('and', 1053), ('in', 957), ('a', 899), ('of', 821), ('i', 789), ('is', 784), ('you', 753), ('will', 654), ('for', 601), ('on', 574), ('thank', 470), ('be', 455), ('great', 447), ('hillary', 440), ('we', 390), ('that', 373), ('s', 363), ('it', 346), ('with', 345), ('at', 333), ('me', 327), ('are', 311), ('amp', 290), ('clinton', 288), ('trump', 287), ('have', 286), ('our', 264), ('realdonaldtrump', 256), ('my', 244), ('all', 237), ('crooked', 236), ('so', 233), ('by', 226), ('this', 222), ('was', 217), ('people', 216), ('has', 210), ('not', 210), ('just', 210), ('america', 204), ('she', 190), ('they', 188), ('trump2016', 180), ('very', 180), ('make', 180), ('from', 175), ('rt', 170), ('out', 169), ('he', 168), ('her', 164), ('makeamericagreatagain', 164), ('join', 161), ('as', 158), ('new', 157), ('who', 155), ('again', 154), ('about', 145), ('no', 142), ('get', 138), ('more', 137), ('now', 136), ('today', 136), ('president', 135), ('can', 134), ('time', 123), ('media', 123), ('vote', 117), ('but', 117), ('am', 116), ('bad', 116), ('going', 115), ('maga', 112), ('u', 112), ('many', 110), ('if', 110), ('country', 108), ('big', 108), ('what', 107), ('your', 105), ('cnn', 105), ('never', 104), ('one', 101), ('up', 101), ('back', 99), ('jobs', 98), ('tonight', 97), ('do', 97), ('been', 97), ('would', 94), ('obama', 93), ('tomorrow', 88), ('said', 88), ('like', 88), ('should', 87), ('when', 86)]
但是,我想以 table 形式显示它,其中包含 header "Word" 和 "Count"。我试过使用 prettytable
包并想出了这个:
from collections import Counter
import re
import prettytable
words = re.findall(r'\w+', open('tweets.txt').read().lower())
for label, data in ('Word', words):
pt = prettytable(field_names=[label, 'Count'])
c = Counter(data)
[pt.add_row(kv) for kv in c.most_common() [:100] ]
pt.align [label], pt.align['Count'] = '1', 'r'
print pt
它给了我 ValueError: too many values to unpack
。我的问题是,我的代码有什么问题,有没有办法使用 prettytable
显示数据?另外,我该如何修改我的代码?
奖金问题:有没有办法在计算频率时遗漏某些单词?例如跳过单词:and, if, of etc etc
谢谢。
这是你想要做的吗?
from prettytable import PrettyTable
x = PrettyTable(["Words", "Counts"])
L = [('the', 1998), ('t', 1829), ('https', 1620), ('co', 1604), ('to', 1247), ('and', 1053), ('in', 957), ('a', 899), ('of', 821), ('i', 789), ('is', 784), ('you', 753), ('will', 654), ('for', 601), ('on', 574), ('thank', 470), ('be', 455), ('great', 447), ('hillary', 440), ('we', 390), ('that', 373), ('s', 363), ('it', 346), ('with', 345), ('at', 333), ('me', 327), ('are', 311), ('amp', 290), ('clinton', 288), ('trump', 287), ('have', 286), ('our', 264), ('realdonaldtrump', 256), ('my', 244), ('all', 237), ('crooked', 236), ('so', 233), ('by', 226), ('this', 222), ('was', 217), ('people', 216), ('has', 210), ('not', 210), ('just', 210), ('america', 204), ('she', 190), ('they', 188), ('trump2016', 180), ('very', 180), ('make', 180), ('from', 175), ('rt', 170), ('out', 169), ('he', 168), ('her', 164), ('makeamericagreatagain', 164), ('join', 161), ('as', 158), ('new', 157), ('who', 155), ('again', 154), ('about', 145), ('no', 142), ('get', 138), ('more', 137), ('now', 136), ('today', 136), ('president', 135), ('can', 134), ('time', 123), ('media', 123), ('vote', 117), ('but', 117), ('am', 116), ('bad', 116), ('going', 115), ('maga', 112), ('u', 112), ('many', 110), ('if', 110), ('country', 108), ('big', 108), ('what', 107), ('your', 105), ('cnn', 105), ('never', 104), ('one', 101), ('up', 101), ('back', 99), ('jobs', 98), ('tonight', 97), ('do', 97), ('been', 97), ('would', 94), ('obama', 93), ('tomorrow', 88), ('said', 88), ('like', 88), ('should', 87), ('when', 86)]
for e in L:
x.add_row([e[0],e[1]])
print x
结果如下:
+-----------------------+--------+
| Words | Counts |
+-----------------------+--------+
| the | 1998 |
| t | 1829 |
| https | 1620 |
| co | 1604 |
| to | 1247 |
| and | 1053 |
| in | 957 |
| a | 899 |
| of | 821 |
| i | 789 |
| is | 784 |
| you | 753 |
| will | 654 |
| for | 601 |
| on | 574 |
| thank | 470 |
| be | 455 |
| great | 447 |
| hillary | 440 |
| we | 390 |
| that | 373 |
| s | 363 |
| it | 346 |
| with | 345 |
| at | 333 |
| me | 327 |
| are | 311 |
| amp | 290 |
| clinton | 288 |
| trump | 287 |
| have | 286 |
| our | 264 |
| realdonaldtrump | 256 |
| my | 244 |
| all | 237 |
| crooked | 236 |
| so | 233 |
| by | 226 |
| this | 222 |
| was | 217 |
| people | 216 |
| has | 210 |
| not | 210 |
| just | 210 |
| america | 204 |
| she | 190 |
| they | 188 |
| trump2016 | 180 |
| very | 180 |
| make | 180 |
| from | 175 |
| rt | 170 |
| out | 169 |
| he | 168 |
| her | 164 |
| makeamericagreatagain | 164 |
| join | 161 |
| as | 158 |
| new | 157 |
| who | 155 |
| again | 154 |
| about | 145 |
| no | 142 |
| get | 138 |
| more | 137 |
| now | 136 |
| today | 136 |
| president | 135 |
| can | 134 |
| time | 123 |
| media | 123 |
| vote | 117 |
| but | 117 |
| am | 116 |
| bad | 116 |
| going | 115 |
| maga | 112 |
| u | 112 |
| many | 110 |
| if | 110 |
| country | 108 |
| big | 108 |
| what | 107 |
| your | 105 |
| cnn | 105 |
| never | 104 |
| one | 101 |
| up | 101 |
| back | 99 |
| jobs | 98 |
| tonight | 97 |
| do | 97 |
| been | 97 |
| would | 94 |
| obama | 93 |
| tomorrow | 88 |
| said | 88 |
| like | 88 |
| should | 87 |
| when | 86 |
+-----------------------+--------+
编辑 1: 如果您想省略某些内容,您可以这样做:
for e in L:
if e[0]!="and" or e[0]!="if" or e[0]!="of":
x.add_row([e[0],e[1]])
编辑 2:总结:
from collections import Counter
import re
words = re.findall(r'\w+', open('tweets.txt').read().lower())
counts = Counter(words).most_common (100)
from prettytable import PrettyTable
x = PrettyTable(["Words", "Counts"])
skip_list = ['and','if','or'] # see joe's comment
for e in counts:
if e[0] not in skip_list:
x.add_row([e[0],e[1]])
print x
我不确定您希望您编写的 for
循环如何工作。您收到的错误是因为您试图遍历具有两个元素的元组 ('Word', words)
。语句 for label, data in ('Word', words)
试图将 'W'
分配给 label
,'o'
分配给 data
,并以 'r'
和 'd'
结束在第一次迭代中。也许您打算将这些项目压缩在一起?但是那你为什么要为每个单词制作一个新的 table?
这是重写的版本:
from collections import Counter
import re, prettytable
words = re.findall(r'\w+', open('tweets.txt').read().lower())
c = Counter(words)
pt = prettytable.PrettyTable(['Words', 'Counts'])
pt.align['Words'] = 'l'
pt.align['Counts'] = 'r'
for row in c.most_common(100):
pt.add_row(row)
print pt
要跳过最常见计数中的元素,您可以在调用 most_common
之前简单地从计数器中丢弃它们。一个简单的方法是定义一个无效单词列表,然后用字典理解过滤掉它们:
bad_words = ['the', 'if', 'of']
c = Counter({k: v for k, v in c.items() if k not in bad_words})
或者,您可以在计算单词列表之前对其进行过滤:
words = filter(lambda x: x not in bad_words, words)
我更喜欢在柜台上操作,因为数据已经汇总,这样需要的工作更少。下面是合并代码供参考:
from collections import Counter
import re, prettytable
bad_words = ['the', 'if', 'of']
words = re.findall(r'\w+', open('tweets.txt').read().lower())
c = Counter(words)
c = Counter({k: v for k, v in c.items() if k not in bad_words})
pt = prettytable.PrettyTable(['Words', 'Counts'])
pt.align['Words'] = 'l'
pt.align['Counts'] = 'r'
for row in c.most_common(100):
pt.add_row(row)
print(pt)
目前我正在对一个文本文件进行频率分析,显示该文本文件中最常用的前 100 个单词。目前我正在使用此代码:
from collections import Counter
import re
words = re.findall(r'\w+', open('tweets.txt').read().lower())
print Counter(words).most_common (100)
上面的代码有效,输出为:
[('the', 1998), ('t', 1829), ('https', 1620), ('co', 1604), ('to', 1247), ('and', 1053), ('in', 957), ('a', 899), ('of', 821), ('i', 789), ('is', 784), ('you', 753), ('will', 654), ('for', 601), ('on', 574), ('thank', 470), ('be', 455), ('great', 447), ('hillary', 440), ('we', 390), ('that', 373), ('s', 363), ('it', 346), ('with', 345), ('at', 333), ('me', 327), ('are', 311), ('amp', 290), ('clinton', 288), ('trump', 287), ('have', 286), ('our', 264), ('realdonaldtrump', 256), ('my', 244), ('all', 237), ('crooked', 236), ('so', 233), ('by', 226), ('this', 222), ('was', 217), ('people', 216), ('has', 210), ('not', 210), ('just', 210), ('america', 204), ('she', 190), ('they', 188), ('trump2016', 180), ('very', 180), ('make', 180), ('from', 175), ('rt', 170), ('out', 169), ('he', 168), ('her', 164), ('makeamericagreatagain', 164), ('join', 161), ('as', 158), ('new', 157), ('who', 155), ('again', 154), ('about', 145), ('no', 142), ('get', 138), ('more', 137), ('now', 136), ('today', 136), ('president', 135), ('can', 134), ('time', 123), ('media', 123), ('vote', 117), ('but', 117), ('am', 116), ('bad', 116), ('going', 115), ('maga', 112), ('u', 112), ('many', 110), ('if', 110), ('country', 108), ('big', 108), ('what', 107), ('your', 105), ('cnn', 105), ('never', 104), ('one', 101), ('up', 101), ('back', 99), ('jobs', 98), ('tonight', 97), ('do', 97), ('been', 97), ('would', 94), ('obama', 93), ('tomorrow', 88), ('said', 88), ('like', 88), ('should', 87), ('when', 86)]
但是,我想以 table 形式显示它,其中包含 header "Word" 和 "Count"。我试过使用 prettytable
包并想出了这个:
from collections import Counter
import re
import prettytable
words = re.findall(r'\w+', open('tweets.txt').read().lower())
for label, data in ('Word', words):
pt = prettytable(field_names=[label, 'Count'])
c = Counter(data)
[pt.add_row(kv) for kv in c.most_common() [:100] ]
pt.align [label], pt.align['Count'] = '1', 'r'
print pt
它给了我 ValueError: too many values to unpack
。我的问题是,我的代码有什么问题,有没有办法使用 prettytable
显示数据?另外,我该如何修改我的代码?
奖金问题:有没有办法在计算频率时遗漏某些单词?例如跳过单词:and, if, of etc etc
谢谢。
这是你想要做的吗?
from prettytable import PrettyTable
x = PrettyTable(["Words", "Counts"])
L = [('the', 1998), ('t', 1829), ('https', 1620), ('co', 1604), ('to', 1247), ('and', 1053), ('in', 957), ('a', 899), ('of', 821), ('i', 789), ('is', 784), ('you', 753), ('will', 654), ('for', 601), ('on', 574), ('thank', 470), ('be', 455), ('great', 447), ('hillary', 440), ('we', 390), ('that', 373), ('s', 363), ('it', 346), ('with', 345), ('at', 333), ('me', 327), ('are', 311), ('amp', 290), ('clinton', 288), ('trump', 287), ('have', 286), ('our', 264), ('realdonaldtrump', 256), ('my', 244), ('all', 237), ('crooked', 236), ('so', 233), ('by', 226), ('this', 222), ('was', 217), ('people', 216), ('has', 210), ('not', 210), ('just', 210), ('america', 204), ('she', 190), ('they', 188), ('trump2016', 180), ('very', 180), ('make', 180), ('from', 175), ('rt', 170), ('out', 169), ('he', 168), ('her', 164), ('makeamericagreatagain', 164), ('join', 161), ('as', 158), ('new', 157), ('who', 155), ('again', 154), ('about', 145), ('no', 142), ('get', 138), ('more', 137), ('now', 136), ('today', 136), ('president', 135), ('can', 134), ('time', 123), ('media', 123), ('vote', 117), ('but', 117), ('am', 116), ('bad', 116), ('going', 115), ('maga', 112), ('u', 112), ('many', 110), ('if', 110), ('country', 108), ('big', 108), ('what', 107), ('your', 105), ('cnn', 105), ('never', 104), ('one', 101), ('up', 101), ('back', 99), ('jobs', 98), ('tonight', 97), ('do', 97), ('been', 97), ('would', 94), ('obama', 93), ('tomorrow', 88), ('said', 88), ('like', 88), ('should', 87), ('when', 86)]
for e in L:
x.add_row([e[0],e[1]])
print x
结果如下:
+-----------------------+--------+
| Words | Counts |
+-----------------------+--------+
| the | 1998 |
| t | 1829 |
| https | 1620 |
| co | 1604 |
| to | 1247 |
| and | 1053 |
| in | 957 |
| a | 899 |
| of | 821 |
| i | 789 |
| is | 784 |
| you | 753 |
| will | 654 |
| for | 601 |
| on | 574 |
| thank | 470 |
| be | 455 |
| great | 447 |
| hillary | 440 |
| we | 390 |
| that | 373 |
| s | 363 |
| it | 346 |
| with | 345 |
| at | 333 |
| me | 327 |
| are | 311 |
| amp | 290 |
| clinton | 288 |
| trump | 287 |
| have | 286 |
| our | 264 |
| realdonaldtrump | 256 |
| my | 244 |
| all | 237 |
| crooked | 236 |
| so | 233 |
| by | 226 |
| this | 222 |
| was | 217 |
| people | 216 |
| has | 210 |
| not | 210 |
| just | 210 |
| america | 204 |
| she | 190 |
| they | 188 |
| trump2016 | 180 |
| very | 180 |
| make | 180 |
| from | 175 |
| rt | 170 |
| out | 169 |
| he | 168 |
| her | 164 |
| makeamericagreatagain | 164 |
| join | 161 |
| as | 158 |
| new | 157 |
| who | 155 |
| again | 154 |
| about | 145 |
| no | 142 |
| get | 138 |
| more | 137 |
| now | 136 |
| today | 136 |
| president | 135 |
| can | 134 |
| time | 123 |
| media | 123 |
| vote | 117 |
| but | 117 |
| am | 116 |
| bad | 116 |
| going | 115 |
| maga | 112 |
| u | 112 |
| many | 110 |
| if | 110 |
| country | 108 |
| big | 108 |
| what | 107 |
| your | 105 |
| cnn | 105 |
| never | 104 |
| one | 101 |
| up | 101 |
| back | 99 |
| jobs | 98 |
| tonight | 97 |
| do | 97 |
| been | 97 |
| would | 94 |
| obama | 93 |
| tomorrow | 88 |
| said | 88 |
| like | 88 |
| should | 87 |
| when | 86 |
+-----------------------+--------+
编辑 1: 如果您想省略某些内容,您可以这样做:
for e in L:
if e[0]!="and" or e[0]!="if" or e[0]!="of":
x.add_row([e[0],e[1]])
编辑 2:总结:
from collections import Counter
import re
words = re.findall(r'\w+', open('tweets.txt').read().lower())
counts = Counter(words).most_common (100)
from prettytable import PrettyTable
x = PrettyTable(["Words", "Counts"])
skip_list = ['and','if','or'] # see joe's comment
for e in counts:
if e[0] not in skip_list:
x.add_row([e[0],e[1]])
print x
我不确定您希望您编写的 for
循环如何工作。您收到的错误是因为您试图遍历具有两个元素的元组 ('Word', words)
。语句 for label, data in ('Word', words)
试图将 'W'
分配给 label
,'o'
分配给 data
,并以 'r'
和 'd'
结束在第一次迭代中。也许您打算将这些项目压缩在一起?但是那你为什么要为每个单词制作一个新的 table?
这是重写的版本:
from collections import Counter
import re, prettytable
words = re.findall(r'\w+', open('tweets.txt').read().lower())
c = Counter(words)
pt = prettytable.PrettyTable(['Words', 'Counts'])
pt.align['Words'] = 'l'
pt.align['Counts'] = 'r'
for row in c.most_common(100):
pt.add_row(row)
print pt
要跳过最常见计数中的元素,您可以在调用 most_common
之前简单地从计数器中丢弃它们。一个简单的方法是定义一个无效单词列表,然后用字典理解过滤掉它们:
bad_words = ['the', 'if', 'of']
c = Counter({k: v for k, v in c.items() if k not in bad_words})
或者,您可以在计算单词列表之前对其进行过滤:
words = filter(lambda x: x not in bad_words, words)
我更喜欢在柜台上操作,因为数据已经汇总,这样需要的工作更少。下面是合并代码供参考:
from collections import Counter
import re, prettytable
bad_words = ['the', 'if', 'of']
words = re.findall(r'\w+', open('tweets.txt').read().lower())
c = Counter(words)
c = Counter({k: v for k, v in c.items() if k not in bad_words})
pt = prettytable.PrettyTable(['Words', 'Counts'])
pt.align['Words'] = 'l'
pt.align['Counts'] = 'r'
for row in c.most_common(100):
pt.add_row(row)
print(pt)