创建常用词列表时出现意外输出。如何获得给定 class 的前 10 个最常用词？

Question

我正在尝试获取数据集中每个 class 的前 10 个最常用词。我有以下 Python 代码，但我不明白输出、为什么会发生这种情况以及如何更正它。

下面是我使用的数据集 (df)

User    Post    Label
0   Nicholas Wyman  Exploring in this months Talent Management HR...    Recruitment
1   Nicholas Wyman  I count myself fortunate to have spent time wi...   Career
2   Nicholas Wyman  This years National Apprenticeship Week comes ...   Recruitment
3   Nicholas Wyman  How will your company tap into workers as a co...   Wellbeing
4   Nicholas Wyman  The momentum for Modern Apprenticeships is bui...   Recruitment

这是我正在使用的代码

#Import dataset
df = pd.read_csv("Folds1345.csv", engine='python',encoding='latin-1')

#Get classes
classes = df['Label'].unique()
classes = classes.tolist()

#Check each class and produce top 10 words
for i in classes:
  print(i)
  df2=df.loc[df['Label'] == i, 'Post']
  df2 = str(remove_stopwords(df['Post']))
  from collections import Counter
  Frequent = Counter(" ".join(df2).split()).most_common(10)
  print(Frequent)

这是输出

Recruitment
[("'", 1213698), (',', 606859), ('e', 507474), ('i', 321003), ('a', 311593), ('n', 303956), ('t', 296568), ('s', 290978), ('r', 276802), ('o', 261175)]
Career
[("'", 1213698), (',', 606859), ('e', 507474), ('i', 321003), ('a', 311593), ('n', 303956), ('t', 296568), ('s', 290978), ('r', 276802), ('o', 261175)]
Wellbeing
[("'", 1213698), (',', 606859), ('e', 507474), ('i', 321003), ('a', 311593), ('n', 303956), ('t', 296568), ('s', 290978), ('r', 276802), ('o', 261175)]
Rewards
[("'", 1213698), (',', 606859), ('e', 507474), ('i', 321003), ('a', 311593), ('n', 303956), ('t', 296568), ('s', 290978), ('r', 276802), ('o', 261175)]
Technology
[("'", 1213698), (',', 606859), ('e', 507474), ('i', 321003), ('a', 311593), ('n', 303956), ('t', 296568), ('s', 290978), ('r', 276802), ('o', 261175)]
Learning
[("'", 1213698), (',', 606859), ('e', 507474), ('i', 321003), ('a', 311593), ('n', 303956), ('t', 296568), ('s', 290978), ('r', 276802), ('o', 261175)]
HR System
[("'", 1213698), (',', 606859), ('e', 507474), ('i', 321003), ('a', 311593), ('n', 303956), ('t', 296568), ('s', 290978), ('r', 276802), ('o', 261175)]
Inclusion
[("'", 1213698), (',', 606859), ('e', 507474), ('i', 321003), ('a', 311593), ('n', 303956), ('t', 296568), ('s', 290978), ('r', 276802), ('o', 261175)]
Diversity
[("'", 1213698), (',', 606859), ('e', 507474), ('i', 321003), ('a', 311593), ('n', 303956), ('t', 296568), ('s', 290978), ('r', 276802), ('o', 261175)]

它似乎在查看单个字母而不是单词并搜索整个数据集而不仅仅是带有所选标签的帖子，但我不明白为什么。

Answer 1

#Import dataset
df = pd.read_csv("Folds1345.csv", engine='python',encoding='latin-1')

#Get classes
classes = df['Label'].unique()
classes = classes.tolist()

for i in classes:
  print(i)
  df2=df.loc[df['Label'] == i, 'Post']
  df2 = df2.apply(lambda x: remove_stopwords(x))
  list_sentences = df2.to_list()
  from collections import Counter
  list_words = (' '.join(str(s) for s in list_sentences)).split(' ')
  Frequent = Counter(list_words).most_common(10)
  print(Frequent)

编辑：你 df2 首先是一个 pandas 系列，然后是一个字符串。我不确定您使用的是什么“remove_stopwords”函数，我想这是来自 gensim 的函数。我修改了代码

EDIT2：这次应该可以了

创建常用词列表时出现意外输出。如何获得给定 class 的前 10 个最常用词？

Unexpected output when creating a list of frequent words. How can I get the top 10 most frequent words for a given class?

python

counter

data-analysis

word-frequency

exploratory-data-analysis