使用 python 获取列表中单词字符串的单词计数

Question

从 pandas 数据框开始，其中第一列由评论字符串组成，其他列是单个单词的特征。对于每一行，我想计算每个单词在该行的评论单元格中出现的次数。我有一个单词列表（特征列）作为一个名为“wordList”的列表，我正在尝试这样的事情但无法让它工作并且计数回到数据框中：

def word_count(comments):
    for word in wordList:
        return comment.count(word)

df.comments.apply(word_count)

我有：

comments        |  hello  |  this  |   is   |  the  |  comments  |  blah  |
--------------------------------------------------------------------------
this is the 1st |         |        |        |       |            |    
comments here   |         |        |        |       |            |
--------------------------------------------------------------------------
the 2nd comment |         |        |        |       |            |    
is is here this |         |        |        |       |            |

我想要的：

comments        |  hello  |  this  |   is   |  the  |  comments  |  blah  |
--------------------------------------------------------------------------
this is the 1st |    0    |    1   |   2    |   1   |     1      |    0
comments is here|         |        |        |       |            |
--------------------------------------------------------------------------
the 2nd comment |    0    |    1   |   2    |   2   |     0      |    0
is is here the  |         |        |        |       |            |

Answer 1

将您的评论列转换为列表并展开。

应用获取假人。这将列出出现频率

使用您要检查的单词列表重新编制索引

聚合频率并加入df.coments列

代码如下：

g=pd.get_dummies(pd.Series(df1.coments.str.split('\s').explode())).reindex(columns=['hello', 'this','is','the','comments','blah']).fillna(0).astype(int)

pd.DataFrame(df1.iloc[:,0]).join(g.groupby(level=0).sum(0))




     coments                         hello  this  is  the  comments  blah
0    this is the 1st comments here      0     1   1    1         1     0
1  the 2nd comment is is here this      0     1   2    1         0     0

Answer 2

您可以使用 str.extract 提取（仅）单词列表中的单词，然后使用 value_counts:

pattern = '|'.join(word_list)
(df.comments.str.extractall(rf'\b({pattern})\b')[0]
   .groupby(level=0).value_counts()
   .unstack(fill_value=0)
   .reindex(word_list, axis=1, fill_value=0)
)

输出（请注意，这也有一个名为 comments 的列，与原始数据框中一样）

0  hello  this  is  the  comments  blah
0      0     1   1    1         1     0
1      0     1   2    1         0     0

使用 python 获取列表中单词字符串的单词计数

Get word counts in strings of words in a list using python

python

count

word

pandas