使用 python 获取列表中单词字符串的单词计数

Get word counts in strings of words in a list using python

从 pandas 数据框开始,其中第一列由评论字符串组成,其他列是单个单词的特征。对于每一行,我想计算每个单词在该行的评论单元格中出现的次数。我有一个单词列表(特征列)作为一个名为“wordList”的列表,我正在尝试这样的事情但无法让它工作并且计数回到数据框中:

def word_count(comments):
    for word in wordList:
        return comment.count(word)

df.comments.apply(word_count)

我有:

comments        |  hello  |  this  |   is   |  the  |  comments  |  blah  |
--------------------------------------------------------------------------
this is the 1st |         |        |        |       |            |    
comments here   |         |        |        |       |            |
--------------------------------------------------------------------------
the 2nd comment |         |        |        |       |            |    
is is here this |         |        |        |       |            |

我想要的:

comments        |  hello  |  this  |   is   |  the  |  comments  |  blah  |
--------------------------------------------------------------------------
this is the 1st |    0    |    1   |   2    |   1   |     1      |    0
comments is here|         |        |        |       |            |
--------------------------------------------------------------------------
the 2nd comment |    0    |    1   |   2    |   2   |     0      |    0
is is here the  |         |        |        |       |            |

将您的评论列转换为列表并展开。

应用获取假人。这将列出出现频率

使用您要检查的单词列表重新编制索引

聚合频率并加入df.coments列

代码如下:

g=pd.get_dummies(pd.Series(df1.coments.str.split('\s').explode())).reindex(columns=['hello', 'this','is','the','comments','blah']).fillna(0).astype(int)

pd.DataFrame(df1.iloc[:,0]).join(g.groupby(level=0).sum(0))




     coments                         hello  this  is  the  comments  blah
0    this is the 1st comments here      0     1   1    1         1     0
1  the 2nd comment is is here this      0     1   2    1         0     0

您可以使用 str.extract 提取(仅)单词列表中的单词,然后使用 value_counts:

pattern = '|'.join(word_list)
(df.comments.str.extractall(rf'\b({pattern})\b')[0]
   .groupby(level=0).value_counts()
   .unstack(fill_value=0)
   .reindex(word_list, axis=1, fill_value=0)
)

输出(请注意,这也有一个名为 comments 的列,与原始数据框中一样)

0  hello  this  is  the  comments  blah
0      0     1   1    1         1     0
1      0     1   2    1         0     0