使用 pandas 在现有数据框列上应用矢量化器，获取新数据框列中的计数矢量化器词汇表

Question

我的数据框列 'review' 的内容类似于 'Food was Awesome'，我想要一个新列来计算每个单词的重复次数。

name      The First Years Massaging Action Teether
review                    A favorite in our house!
rating                                           5
Name: 269, dtype: object

期望输出像 ['Food':1,'was':1,'Awesome':1] 我尝试使用 for 循环，但执行时间太长

for row in range(products.shape[0]):
try:        
    count_vect.fit_transform([products['review_without_punctuation'][row]])
    products['word_count'][row]=count_vect.vocabulary_
except:
    print(row)

我想不用for循环来做。

Answer 1

您可以像这样获取所有文档的计数向量：

cv = CountVectorizer()
count_vectors = cv.fit_transform(products['review_without_punctuation'])

要按索引获取特定文档的数组格式计数向量，例如第一个文档，

count_vectors[0].toarray()

词汇在

cv.vocabulary_

要获取构成计数向量的单词，例如，对于第一个文档，请使用

cv.inverse_transform(count_vectors[0])

Answer 2

我找到了解决方案。我定义了这样一个函数-

def Vectorize(text):
try:
    count_vect.fit_transform([text])
    return count_vect.vocabulary_
except:
    return-1

并应用以上函数-

from sklearn.feature_extraction.text import CountVectorizer
count_vect = CountVectorizer()
products['word_count'] = products['review_without_punctuation'].apply(Vectorize)

这个解决方案奏效了，我在新专栏中获得了词汇。

使用 pandas 在现有数据框列上应用矢量化器，获取新数据框列中的计数矢量化器词汇表

Get count vectorizer vocabulary in new dataframe column by applying vectorizer on existing dataframe column using pandas

pandas

scikit-learn

countvectorizer