CountVectorizer: 在单个文本行上变换方法 returns 多维数组

Question

首先，我将其拟合到短信语料库中：

from sklearn.feature_extraction.text import CountVectorizer
clf = CountVectorizer()
X_desc = clf.fit_transform(X).toarray()

似乎工作正常：

X.shape = (5574,)
X_desc.shape = (5574, 8713)

但是后来我对文本行应用了转换方法，正如我们所知，结果它应该是 (, 8713) 形状，但是我们看到的是：

str2 = 'Have you visited the last lecture on physics?'
print len(str2), clf.transform(str2).toarray().shape

52 (52, 8713)

这是怎么回事？还有一件事 - 所有数字都是零

Answer 1

您总是需要将数组或向量传递给transform；如果只想对单个元素进行变换，需要传入一个单例数组，然后提取其内容：

clf.transform([str1])[0]

顺便说一下，你得到一个二维数组作为输出的原因是字符串实际上存储为一个字符列表，所以向量化器将你的字符串视为一个数组，其中每个字符都被视为单个文档。

CountVectorizer: transform method returns multidimensional array on a single text line