如何使用带有权重(来自列值)而不是计数的 CountVectorizer 进行加权 unigram/bigram/trigram?
How do I do a weighted unigram/bigram/trigram using CountVectorizer with weights (from a column value) instead of count?
我的数据集包含一个文本块和一个带有汇总计数的列,它看起来像这样:
text, count (column name)
this is my home,100
where am i,10
this is a piece of cake, 2
我通过互联网获得的构建一元组的代码
def get_top_n_words(corpus, n=None):
vec = sk.feature_extraction.text.CountVectorizer().fit(corpus)
bag_of_words = vec.transform(corpus)
sum_words = bag_of_words.sum(axis=0)
words_freq = [(word, sum_words[0, idx]) for word, idx in vec.vocabulary_.items()]
words_freq =sorted(words_freq, key = lambda x: x[1], reverse=True)
return words_freq[:n]
common_words = get_top_n_words(df['text'], 20)
使用标准的 CountVectorizer,我会生成一个 unigram,如下所示:
this 2
is 2
my 1
where 1
am 1
i 1
a 1
piece 1
of 1
cake 1
我希望它可以按计数加权,因为它是汇总计数,即:
this 102
is 102
my 100
where 10
am 10
i 10
a 2
piece 2
of 2
cake 2
这可能吗?
你可以做的是在 transform
之后使用 toarray
方法,以便能够与之后的计数值进行矩阵乘法:
def get_top_n_words(corpus, count, n=None): # add the parameter with the count values
vec = feature_extraction.text.CountVectorizer().fit(corpus)
# here multiply the toarray of transform with the count values
bag_of_words = vec.transform(corpus).toarray()*count.values[:,None]
sum_words = bag_of_words.sum(axis=0)
# accessing the value in sum_words is a bit different but still related to idx
words_freq = [(word, sum_words[idx]) for word, idx in vec.vocabulary_.items()]
words_freq =sorted(words_freq, key = lambda x: x[1], reverse=True)
return words_freq[:n]
common_words = get_top_n_words(df['text'], df['count'], 20)
print (common_words)
[('this', 102),
('is', 102),
('my', 100),
('home', 100),
('where', 10),
('am', 10),
('piece', 2),
('of', 2),
('cake', 2)]
我的数据集包含一个文本块和一个带有汇总计数的列,它看起来像这样:
text, count (column name)
this is my home,100
where am i,10
this is a piece of cake, 2
我通过互联网获得的构建一元组的代码
def get_top_n_words(corpus, n=None):
vec = sk.feature_extraction.text.CountVectorizer().fit(corpus)
bag_of_words = vec.transform(corpus)
sum_words = bag_of_words.sum(axis=0)
words_freq = [(word, sum_words[0, idx]) for word, idx in vec.vocabulary_.items()]
words_freq =sorted(words_freq, key = lambda x: x[1], reverse=True)
return words_freq[:n]
common_words = get_top_n_words(df['text'], 20)
使用标准的 CountVectorizer,我会生成一个 unigram,如下所示:
this 2
is 2
my 1
where 1
am 1 i 1
a 1
piece 1
of 1
cake 1
我希望它可以按计数加权,因为它是汇总计数,即:
this 102
is 102
my 100
where 10
am 10
i 10
a 2
piece 2
of 2
cake 2
这可能吗?
你可以做的是在 transform
之后使用 toarray
方法,以便能够与之后的计数值进行矩阵乘法:
def get_top_n_words(corpus, count, n=None): # add the parameter with the count values
vec = feature_extraction.text.CountVectorizer().fit(corpus)
# here multiply the toarray of transform with the count values
bag_of_words = vec.transform(corpus).toarray()*count.values[:,None]
sum_words = bag_of_words.sum(axis=0)
# accessing the value in sum_words is a bit different but still related to idx
words_freq = [(word, sum_words[idx]) for word, idx in vec.vocabulary_.items()]
words_freq =sorted(words_freq, key = lambda x: x[1], reverse=True)
return words_freq[:n]
common_words = get_top_n_words(df['text'], df['count'], 20)
print (common_words)
[('this', 102),
('is', 102),
('my', 100),
('home', 100),
('where', 10),
('am', 10),
('piece', 2),
('of', 2),
('cake', 2)]