如何根据另一列的值获取文本列中出现频率最高的单词?

How do I get the most frequent words in a column of text based on the value of another column?

我有一个推文数据集及其发布年份。我想统计每年出现频率最高的单词。我的数据集如下所示:

year     tweet
2015     my car is blue
2015     mom is making dinner
2016     my hair is red
2016     i love my mom

我只知道如何获取整个数据集中出现频率最高的词:

pd.Series(' '.join(df['tweets']).split()).value_counts()

哪个会给我这个:

my      3
is      3
mom     2
car     1
blue    1
making  1
dinner  1
hair    1
red     1
i       1
love    1

那么我怎样才能得到这样的东西呢?

2015

is      2
my      1
car     1
blue    1
mom     1
making  1
dinner  1

2016

my      2
hair    1
is      1
red     1
i       1
love    1
mom     1

我会这样做:

counts = df.set_index('year')['tweet'].str.split().explode().groupby(level=0).apply(pd.value_counts)

输出:

>>> counts
year        
2015  is        2
      my        1
      car       1
      blue      1
      mom       1
      making    1
      dinner    1
2016  my        2
      hair      1
      is        1
      red       1
      i         1
      love      1
      mom       1
Name: tweet, dtype: int6

要获得顶级,比如说,每年 5 件商品:

df.set_index('year')['tweet'].str.split().explode().groupby(level=0).apply(lambda x: x.value_counts().head(5))