计算 python 中数据框中每一行中单词列表的频率
Counting frequencies of a list of words in each row in a data frame in python
我想问一个关于如何从列名列表中为现有数据框创建新列名的问题。我正在计算数据框中每个字符串中的动词频率。动词列表如下所示:
<bound method DataFrame.to_dict of verb
0 agree
1 bear
2 care
3 choose
4 be>
下面的代码有效,但输出是所有单词的总频率,而不是为单词列表中的每个单词创建列名。
#ver.1 code
import pandas as pd
verb = pd.read_csv('cog_verb.csv')
df2 = pd.DataFrame(df.answer_id)
for x in verb:
df2[f'count_{x}'] = lemma.str.count('|'.join(r"\b{}\b".format(x)))
代码已更新,反映了 Drakax 的有用评论,如下所示:
#updated code
for x in verb:
df2.to_dict()[f'count_{x}'] = lemma.str.count('|'.join(r"\b{}\b".format(x)))
但是两个代码都产生了相同的以下输出:
<bound method DataFrame.to_dict of answer_id count_verb
0 312 91
1 1110 123
2 2700 102
3 2764 217
4 2806 182
.. ... ...
321 33417 336
322 36558 517
323 37316 137
324 37526 119
325 45683 1194
[326 rows x 2 columns]>
-----更新信息----
根据 Drakax 的建议,我在下面添加了第一个数据框。
df.to_dict
<bound method DataFrame.to_dict of answer_id text
0 312 ANON_NAME_0\n Here are a few instructions for ...
1 1110 October16,2006 \nDear Dad,\n\n I am going to g...
2 2700 My Writing Habits\n I do many things before I...
3 2764 My Ideas about Writing\n I have many ideas bef...
4 2806 I've main habits for writing and I sure each o...
.. ... ...
321 33417 ????????????????????????\n???????????????? ?? ...
322 36558 In this world, there are countless numbers of...
323 37316 My Friend's Room\nWhen I was kid I used to go ...
324 37526 ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ...
325 45683 Primary and Secondary Education in South Korea...
[326 rows x 2 columns]>
虽然上面的输出是正确的,但我希望将每个单词的频率数据应用于每一列。
感谢您提供的任何帮助。非常感谢!
好吧,它似乎仍然是一团糟,但我想我已经理解你想要什么,你可以 adapt/update 你的代码与我的:
1.这一步只针对我;使用随机生成的 str:
创建新的 DF
from pandas._testing import rands_array
randstr = pd.util.testing.rands_array(10, 10)
df = pd.DataFrame(data=randstr, columns=["randstr"])
df
index
randstr
count
0
20uDmHdBL5
1
1
E62AeycGdy
1
2
tHz99eI8BC
1
3
iZLXfs7R4k
1
4
bURRiuxHvc
2
5
lBDzVuB3z9
1
6
GuIZHOYUr5
1
7
k4wVvqeRkD
1
8
oAIGt8pHbI
1
9
N3BUMfit7a
2
2. 然后要计算所需正则表达式的出现次数,只需执行以下操作:
reg = ['a','e','i','o','u'] #this is where you stock your verbs
def count_reg(df):
for i in reg:
df[i] = df['randstr'].str.count(i)
return df
count_reg(df)
index
randstr
a
e
i
o
u
0
h2wcd5yULo
0
0
0
1
0
1
uI400TZnJl
0
0
0
0
1
2
qMiI7morYG
0
0
1
1
0
3
f6Aw6AH3TL
0
0
0
0
0
4
nJ0h9IsDn6
0
0
0
0
0
5
tWyNxnzLwv
0
0
0
0
0
6
V4sTYcPsiB
0
0
1
0
0
7
tSgni67247
0
0
1
0
0
8
sUZn3L08JN
0
0
0
0
0
9
qDiG3Zynk0
0
0
1
0
0
我想问一个关于如何从列名列表中为现有数据框创建新列名的问题。我正在计算数据框中每个字符串中的动词频率。动词列表如下所示:
<bound method DataFrame.to_dict of verb
0 agree
1 bear
2 care
3 choose
4 be>
下面的代码有效,但输出是所有单词的总频率,而不是为单词列表中的每个单词创建列名。
#ver.1 code
import pandas as pd
verb = pd.read_csv('cog_verb.csv')
df2 = pd.DataFrame(df.answer_id)
for x in verb:
df2[f'count_{x}'] = lemma.str.count('|'.join(r"\b{}\b".format(x)))
代码已更新,反映了 Drakax 的有用评论,如下所示:
#updated code
for x in verb:
df2.to_dict()[f'count_{x}'] = lemma.str.count('|'.join(r"\b{}\b".format(x)))
但是两个代码都产生了相同的以下输出:
<bound method DataFrame.to_dict of answer_id count_verb
0 312 91
1 1110 123
2 2700 102
3 2764 217
4 2806 182
.. ... ...
321 33417 336
322 36558 517
323 37316 137
324 37526 119
325 45683 1194
[326 rows x 2 columns]>
-----更新信息----
根据 Drakax 的建议,我在下面添加了第一个数据框。
df.to_dict
<bound method DataFrame.to_dict of answer_id text
0 312 ANON_NAME_0\n Here are a few instructions for ...
1 1110 October16,2006 \nDear Dad,\n\n I am going to g...
2 2700 My Writing Habits\n I do many things before I...
3 2764 My Ideas about Writing\n I have many ideas bef...
4 2806 I've main habits for writing and I sure each o...
.. ... ...
321 33417 ????????????????????????\n???????????????? ?? ...
322 36558 In this world, there are countless numbers of...
323 37316 My Friend's Room\nWhen I was kid I used to go ...
324 37526 ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ...
325 45683 Primary and Secondary Education in South Korea...
[326 rows x 2 columns]>
虽然上面的输出是正确的,但我希望将每个单词的频率数据应用于每一列。 感谢您提供的任何帮助。非常感谢!
好吧,它似乎仍然是一团糟,但我想我已经理解你想要什么,你可以 adapt/update 你的代码与我的:
1.这一步只针对我;使用随机生成的 str:
创建新的 DFfrom pandas._testing import rands_array
randstr = pd.util.testing.rands_array(10, 10)
df = pd.DataFrame(data=randstr, columns=["randstr"])
df
index | randstr | count |
---|---|---|
0 | 20uDmHdBL5 | 1 |
1 | E62AeycGdy | 1 |
2 | tHz99eI8BC | 1 |
3 | iZLXfs7R4k | 1 |
4 | bURRiuxHvc | 2 |
5 | lBDzVuB3z9 | 1 |
6 | GuIZHOYUr5 | 1 |
7 | k4wVvqeRkD | 1 |
8 | oAIGt8pHbI | 1 |
9 | N3BUMfit7a | 2 |
2. 然后要计算所需正则表达式的出现次数,只需执行以下操作:
reg = ['a','e','i','o','u'] #this is where you stock your verbs
def count_reg(df):
for i in reg:
df[i] = df['randstr'].str.count(i)
return df
count_reg(df)
index | randstr | a | e | i | o | u |
---|---|---|---|---|---|---|
0 | h2wcd5yULo | 0 | 0 | 0 | 1 | 0 |
1 | uI400TZnJl | 0 | 0 | 0 | 0 | 1 |
2 | qMiI7morYG | 0 | 0 | 1 | 1 | 0 |
3 | f6Aw6AH3TL | 0 | 0 | 0 | 0 | 0 |
4 | nJ0h9IsDn6 | 0 | 0 | 0 | 0 | 0 |
5 | tWyNxnzLwv | 0 | 0 | 0 | 0 | 0 |
6 | V4sTYcPsiB | 0 | 0 | 1 | 0 | 0 |
7 | tSgni67247 | 0 | 0 | 1 | 0 | 0 |
8 | sUZn3L08JN | 0 | 0 | 0 | 0 | 0 |
9 | qDiG3Zynk0 | 0 | 0 | 1 | 0 | 0 |