如何获取 sklearn count vectorizer 返回的矩阵中的列总和?

How to get column sum in the matrix returned by sklearn count vectorizer?

如何获取 sklearn 返回的词频矩阵中任意给定列的总和 CountVectorizer?

import pandas as pd
from sklearn.feature_extraction.text import CountVectorizer
vectorizer = CountVectorizer()

corpus = [ 'This is a sentence',
           'Another sentence is here',
           'Wait for another sentence',
           'The sentence is coming',
           'The sentence has come'
         ]

x = vectorizer.fit_transform(corpus)

例如我想找出矩阵中sentence的频率。所以我想要 sentence 列的总和。我想不出办法来做到这一点:

您可以尝试以下方法:

  1. 从 CountVectorizer 获取您的术语在 feature_names() 列表中的位置。
  2. 使用该位置对 CSR 矩阵中的所有列求和(x,在您的情况下)。

代码:

import numpy as np

term_to_sum = 'sentence'    
index_term = vectorizer.get_feature_names().index(term_to_sum)

s = np.sum(x[:, index_term])  # here you get the sum