来自 CountVectorizer 的术语相对频率矩阵

Question

有没有办法从绝对频率矩阵（用CountVectorizer方法得到）开始得到相对频率矩阵？这是使用的代码：

body = [
    'the quick brown fox',
    'the slow brown dog',
    'the quick red dog',
    'the lazy yellow fox'
]

from sklearn.feature_extraction.text import CountVectorizer

vectorizer = CountVectorizer(stop_words='english')
bag_of_words = vectorizer.fit_transform(body)

from sklearn.decomposition import TruncatedSVD

svd = TruncatedSVD(n_components=2)
lsa = svd.fit_transform(bag_of_words)

我的目标是使用函数 fit_transform()（在我的代码的最后一行）而不是绝对频率矩阵，而是相对频率矩阵。特别是，我想找到一种方法将矩阵 bag_of_words 的每一行除以行本身的总和。这对我来说不是直接的，因为矩阵是稀疏的。

如有任何建议或建议，我们将不胜感激。谢谢。

Answer 1

这可以使用 TfidfVectorizer 而不是 CountVectorizer 来完成。但是，这需要更改以下默认参数：

您可以删除 tfidf 向量化器的“idf”部分，只留下词频
默认情况下，计数由 L2 范数归一化，您在这里想要的（由所有计数的总和归一化）是 L1 范数

实际上，它看起来像这样：

from sklearn.feature_extraction.text import TfidfVectorizer
body = [
    'the quick brown fox',
    'the slow brown dog',
    'the quick red dog',
    'the lazy yellow fox'
]
vectorizer = TfidfVectorizer(use_idf=False, norm="l1")
X = vectorizer.fit_transform(body)
print(vectorizer.get_feature_names())

这将 return:

array([[0.25, 0.  , 0.25, 0.  , 0.25, 0.  , 0.  , 0.25, 0.  ],
       [0.25, 0.25, 0.  , 0.  , 0.  , 0.  , 0.25, 0.25, 0.  ],
       [0.  , 0.25, 0.  , 0.  , 0.25, 0.25, 0.  , 0.25, 0.  ],
       [0.  , 0.  , 0.25, 0.25, 0.  , 0.  , 0.  , 0.25, 0.25]])

['brown', 'dog', 'fox', 'lazy', 'quick', 'red', 'slow', 'the', 'yellow']

来自 CountVectorizer 的术语相对频率矩阵

Term relative frequency matrix from CountVectorizer

python

scipy

scikit-learn

countvectorizer