使用 pandas 按类别列计算 n-gram
compute n-grams by category column with pandas
我正在尝试查找 python 中 pandas 列的最常用的 n-gram。我设法收集了以下代码,使我能够做到这一点。
但是我希望按“类别”列拆分结果。而不是像
这样的 bi-gram|total frequency
行
"blue orange"|1
我要三栏bi-gram|frequency fruit|frequency|meat
喜欢
"blue orange"|1|0
from sklearn.feature_extraction.text import CountVectorizer
data = {'text':['blue orange is tired', 'an apple', 'meat are great for my stomach'],
'category':['fruit', 'fruit', 'meat']}
df = pd.DataFrame(data)
word_vectorizer = CountVectorizer(ngram_range = (2, 3), analyzer = 'word')
sparse_matrix = word_vectorizer.fit_transform(df['text'])
frequencies = sum(sparse_matrix).toarray()[0]
df_ngrams = pd.DataFrame(frequencies, index = word_vectorizer.get_feature_names_out(), columns = ['frequency'])
df_ngrams.sort_values('frequency', ascending = False).head(50)
将您的代码重构为一个函数,您可以 apply
每组:
def compute_ngram_freq(df):
word_vectorizer = CountVectorizer(ngram_range = (2, 3), analyzer = 'word')
sparse_matrix = word_vectorizer.fit_transform(df['text'])
frequencies = sum(sparse_matrix).toarray()[0]
df_ngrams = pd.DataFrame(frequencies, index = word_vectorizer.get_feature_names_out(), columns = ['frequency'])
return df_ngrams.sort_values('frequency', ascending = False)
out = df.groupby('category').apply(compute_ngram_freq).unstack(level=0, fill_value=0)
输出:
frequency
category fruit meat
an apple 1 0
are great 0 1
are great for 0 1
blue orange 1 0
blue orange is 1 0
for my 0 1
for my stomach 0 1
great for 0 1
great for my 0 1
is tired 1 0
meat are 0 1
meat are great 0 1
my stomach 0 1
orange is 1 0
orange is tired 1 0
我正在尝试查找 python 中 pandas 列的最常用的 n-gram。我设法收集了以下代码,使我能够做到这一点。
但是我希望按“类别”列拆分结果。而不是像
这样的bi-gram|total frequency
行
"blue orange"|1
我要三栏bi-gram|frequency fruit|frequency|meat
喜欢
"blue orange"|1|0
from sklearn.feature_extraction.text import CountVectorizer
data = {'text':['blue orange is tired', 'an apple', 'meat are great for my stomach'],
'category':['fruit', 'fruit', 'meat']}
df = pd.DataFrame(data)
word_vectorizer = CountVectorizer(ngram_range = (2, 3), analyzer = 'word')
sparse_matrix = word_vectorizer.fit_transform(df['text'])
frequencies = sum(sparse_matrix).toarray()[0]
df_ngrams = pd.DataFrame(frequencies, index = word_vectorizer.get_feature_names_out(), columns = ['frequency'])
df_ngrams.sort_values('frequency', ascending = False).head(50)
将您的代码重构为一个函数,您可以 apply
每组:
def compute_ngram_freq(df):
word_vectorizer = CountVectorizer(ngram_range = (2, 3), analyzer = 'word')
sparse_matrix = word_vectorizer.fit_transform(df['text'])
frequencies = sum(sparse_matrix).toarray()[0]
df_ngrams = pd.DataFrame(frequencies, index = word_vectorizer.get_feature_names_out(), columns = ['frequency'])
return df_ngrams.sort_values('frequency', ascending = False)
out = df.groupby('category').apply(compute_ngram_freq).unstack(level=0, fill_value=0)
输出:
frequency
category fruit meat
an apple 1 0
are great 0 1
are great for 0 1
blue orange 1 0
blue orange is 1 0
for my 0 1
for my stomach 0 1
great for 0 1
great for my 0 1
is tired 1 0
meat are 0 1
meat are great 0 1
my stomach 0 1
orange is 1 0
orange is tired 1 0