使用 python 获取列表中单词字符串的单词计数
Get word counts in strings of words in a list using python
从 pandas 数据框开始,其中第一列由评论字符串组成,其他列是单个单词的特征。对于每一行,我想计算每个单词在该行的评论单元格中出现的次数。我有一个单词列表(特征列)作为一个名为“wordList”的列表,我正在尝试这样的事情但无法让它工作并且计数回到数据框中:
def word_count(comments):
for word in wordList:
return comment.count(word)
df.comments.apply(word_count)
我有:
comments | hello | this | is | the | comments | blah |
--------------------------------------------------------------------------
this is the 1st | | | | | |
comments here | | | | | |
--------------------------------------------------------------------------
the 2nd comment | | | | | |
is is here this | | | | | |
我想要的:
comments | hello | this | is | the | comments | blah |
--------------------------------------------------------------------------
this is the 1st | 0 | 1 | 2 | 1 | 1 | 0
comments is here| | | | | |
--------------------------------------------------------------------------
the 2nd comment | 0 | 1 | 2 | 2 | 0 | 0
is is here the | | | | | |
将您的评论列转换为列表并展开。
应用获取假人。这将列出出现频率
使用您要检查的单词列表重新编制索引
聚合频率并加入df.coments列
代码如下:
g=pd.get_dummies(pd.Series(df1.coments.str.split('\s').explode())).reindex(columns=['hello', 'this','is','the','comments','blah']).fillna(0).astype(int)
pd.DataFrame(df1.iloc[:,0]).join(g.groupby(level=0).sum(0))
coments hello this is the comments blah
0 this is the 1st comments here 0 1 1 1 1 0
1 the 2nd comment is is here this 0 1 2 1 0 0
您可以使用 str.extract
提取(仅)单词列表中的单词,然后使用 value_counts
:
pattern = '|'.join(word_list)
(df.comments.str.extractall(rf'\b({pattern})\b')[0]
.groupby(level=0).value_counts()
.unstack(fill_value=0)
.reindex(word_list, axis=1, fill_value=0)
)
输出(请注意,这也有一个名为 comments
的列,与原始数据框中一样)
0 hello this is the comments blah
0 0 1 1 1 1 0
1 0 1 2 1 0 0
从 pandas 数据框开始,其中第一列由评论字符串组成,其他列是单个单词的特征。对于每一行,我想计算每个单词在该行的评论单元格中出现的次数。我有一个单词列表(特征列)作为一个名为“wordList”的列表,我正在尝试这样的事情但无法让它工作并且计数回到数据框中:
def word_count(comments):
for word in wordList:
return comment.count(word)
df.comments.apply(word_count)
我有:
comments | hello | this | is | the | comments | blah |
--------------------------------------------------------------------------
this is the 1st | | | | | |
comments here | | | | | |
--------------------------------------------------------------------------
the 2nd comment | | | | | |
is is here this | | | | | |
我想要的:
comments | hello | this | is | the | comments | blah |
--------------------------------------------------------------------------
this is the 1st | 0 | 1 | 2 | 1 | 1 | 0
comments is here| | | | | |
--------------------------------------------------------------------------
the 2nd comment | 0 | 1 | 2 | 2 | 0 | 0
is is here the | | | | | |
将您的评论列转换为列表并展开。
应用获取假人。这将列出出现频率
使用您要检查的单词列表重新编制索引
聚合频率并加入df.coments列
代码如下:
g=pd.get_dummies(pd.Series(df1.coments.str.split('\s').explode())).reindex(columns=['hello', 'this','is','the','comments','blah']).fillna(0).astype(int)
pd.DataFrame(df1.iloc[:,0]).join(g.groupby(level=0).sum(0))
coments hello this is the comments blah
0 this is the 1st comments here 0 1 1 1 1 0
1 the 2nd comment is is here this 0 1 2 1 0 0
您可以使用 str.extract
提取(仅)单词列表中的单词,然后使用 value_counts
:
pattern = '|'.join(word_list)
(df.comments.str.extractall(rf'\b({pattern})\b')[0]
.groupby(level=0).value_counts()
.unstack(fill_value=0)
.reindex(word_list, axis=1, fill_value=0)
)
输出(请注意,这也有一个名为 comments
的列,与原始数据框中一样)
0 hello this is the comments blah
0 0 1 1 1 1 0
1 0 1 2 1 0 0