计算 python 中数据框中每一行中单词列表的频率

Counting frequencies of a list of words in each row in a data frame in python

我想问一个关于如何从列名列表中为现有数据框创建新列名的问题。我正在计算数据框中每个字符串中的动词频率。动词列表如下所示:

<bound method DataFrame.to_dict of      verb
0   agree
1    bear
2    care
3  choose
4      be>

下面的代码有效,但输出是所有单词的总频率,而不是为单词列表中的每个单词创建列名。

#ver.1 code
import pandas as pd

verb = pd.read_csv('cog_verb.csv')
df2 = pd.DataFrame(df.answer_id)

for x in verb:
    df2[f'count_{x}'] = lemma.str.count('|'.join(r"\b{}\b".format(x)))

代码已更新,反映了 Drakax 的有用评论,如下所示:

#updated code
for x in verb:
    df2.to_dict()[f'count_{x}'] = lemma.str.count('|'.join(r"\b{}\b".format(x)))

但是两个代码都产生了相同的以下输出:

<bound method DataFrame.to_dict of      answer_id  count_verb
0          312          91
1         1110         123
2         2700         102
3         2764         217
4         2806         182
..         ...         ...
321      33417         336
322      36558         517
323      37316         137
324      37526         119
325      45683        1194

[326 rows x 2 columns]>

-----更新信息----

根据 Drakax 的建议,我在下面添加了第一个数据框。

df.to_dict

  <bound method DataFrame.to_dict of      answer_id                                               text
0          312  ANON_NAME_0\n Here are a few instructions for ...
1         1110  October16,2006 \nDear Dad,\n\n I am going to g...
2         2700   My Writing Habits\n I do many things before I...
3         2764  My Ideas about Writing\n I have many ideas bef...
4         2806  I've main habits for writing and I sure each o...
..         ...                                                ...
321      33417  ????????????????????????\n???????????????? ?? ...
322      36558   In this world, there are countless numbers of...
323      37316  My Friend's Room\nWhen I was kid I used to go ...
324      37526   ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ...
325      45683  Primary and Secondary Education in South Korea...

[326 rows x 2 columns]>

虽然上面的输出是正确的,但我希望将每个单词的频率数据应用于每一列。 感谢您提供的任何帮助。非常感谢!

好吧,它似乎仍然是一团糟,但我想我已经理解你想要什么,你可以 adapt/update 你的代码与我的:

1.这一步只针对我;使用随机生成的 str:

创建新的 DF
from pandas._testing import rands_array
randstr = pd.util.testing.rands_array(10, 10)
df = pd.DataFrame(data=randstr, columns=["randstr"])
df
index randstr count
0 20uDmHdBL5 1
1 E62AeycGdy 1
2 tHz99eI8BC 1
3 iZLXfs7R4k 1
4 bURRiuxHvc 2
5 lBDzVuB3z9 1
6 GuIZHOYUr5 1
7 k4wVvqeRkD 1
8 oAIGt8pHbI 1
9 N3BUMfit7a 2

2. 然后要计算所需正则表达式的出现次数,只需执行以下操作:

reg = ['a','e','i','o','u'] #this is where you stock your verbs

def count_reg(df):
  for i in reg:
    df[i] = df['randstr'].str.count(i)
  return df

count_reg(df)
index randstr a e i o u
0 h2wcd5yULo 0 0 0 1 0
1 uI400TZnJl 0 0 0 0 1
2 qMiI7morYG 0 0 1 1 0
3 f6Aw6AH3TL 0 0 0 0 0
4 nJ0h9IsDn6 0 0 0 0 0
5 tWyNxnzLwv 0 0 0 0 0
6 V4sTYcPsiB 0 0 1 0 0
7 tSgni67247 0 0 1 0 0
8 sUZn3L08JN 0 0 0 0 0
9 qDiG3Zynk0 0 0 1 0 0