如何在 CountVectorizer 中对句子应用权重（计算每个句子标记数次）

Question

我正在使用 CountVectorizer 创建共现矩阵的稀疏矩阵表示。

我有一个句子列表，还有另一个列表（向量）"weights" - 我希望对每个句子标记进行计数的次数。

可以创建一个列表，每个句子根据其相关权重重复多次，但这是非常低效且非 pythonic 的。我的一些体重在数百万以上。

如何有效地告诉 CountVectorizer 使用我拥有的权重向量？

Answer 1

由于无法（我能找到）对提供给 countvectorizer 的每个句子应用权重，因此可以乘以生成的稀疏矩阵。

cv = CountVectorizer(lowercase = False, min_df=0.001, tokenizer = space_splitter)
X = cv.fit_transform(all_strings)

# Multiply the resulting sparse matrix by the weight (count) of each sentence.
counts = scipy.sparse.diags(df.weight, 0)
X = (X.T*counts).T
Xc = (X.T * X) # create co-occurance matrix

请注意，乘以的矩阵必须是稀疏矩阵，并且权重必须在其对角线上。

如何在 CountVectorizer 中对句子应用权重（计算每个句子标记数次）

How to apply weights to sentences in CountVectorizer (count each sentences tokens several times)

python

nlp

scikit-learn

countvectorizer