如何根据另一列的值获取文本列中出现频率最高的单词?
How do I get the most frequent words in a column of text based on the value of another column?
我有一个推文数据集及其发布年份。我想统计每年出现频率最高的单词。我的数据集如下所示:
year tweet
2015 my car is blue
2015 mom is making dinner
2016 my hair is red
2016 i love my mom
我只知道如何获取整个数据集中出现频率最高的词:
pd.Series(' '.join(df['tweets']).split()).value_counts()
哪个会给我这个:
my 3
is 3
mom 2
car 1
blue 1
making 1
dinner 1
hair 1
red 1
i 1
love 1
那么我怎样才能得到这样的东西呢?
2015
is 2
my 1
car 1
blue 1
mom 1
making 1
dinner 1
2016
my 2
hair 1
is 1
red 1
i 1
love 1
mom 1
我会这样做:
counts = df.set_index('year')['tweet'].str.split().explode().groupby(level=0).apply(pd.value_counts)
输出:
>>> counts
year
2015 is 2
my 1
car 1
blue 1
mom 1
making 1
dinner 1
2016 my 2
hair 1
is 1
red 1
i 1
love 1
mom 1
Name: tweet, dtype: int6
要获得顶级,比如说,每年 5 件商品:
df.set_index('year')['tweet'].str.split().explode().groupby(level=0).apply(lambda x: x.value_counts().head(5))
我有一个推文数据集及其发布年份。我想统计每年出现频率最高的单词。我的数据集如下所示:
year tweet
2015 my car is blue
2015 mom is making dinner
2016 my hair is red
2016 i love my mom
我只知道如何获取整个数据集中出现频率最高的词:
pd.Series(' '.join(df['tweets']).split()).value_counts()
哪个会给我这个:
my 3
is 3
mom 2
car 1
blue 1
making 1
dinner 1
hair 1
red 1
i 1
love 1
那么我怎样才能得到这样的东西呢?
2015
is 2
my 1
car 1
blue 1
mom 1
making 1
dinner 1
2016
my 2
hair 1
is 1
red 1
i 1
love 1
mom 1
我会这样做:
counts = df.set_index('year')['tweet'].str.split().explode().groupby(level=0).apply(pd.value_counts)
输出:
>>> counts
year
2015 is 2
my 1
car 1
blue 1
mom 1
making 1
dinner 1
2016 my 2
hair 1
is 1
red 1
i 1
love 1
mom 1
Name: tweet, dtype: int6
要获得顶级,比如说,每年 5 件商品:
df.set_index('year')['tweet'].str.split().explode().groupby(level=0).apply(lambda x: x.value_counts().head(5))