fit_transform 使用 CountVectorizer 时出错

Question

所以我有一个数据框 X 看起来像这样：

X.head()

    0    My wife took me here on my birthday for breakf...
    1    I have no idea why some people give bad review...
    3    Rosie, Dakota, and I LOVE Chaparral Dog Park!!...
    4    General Manager Scott Petello is a good egg!!!...
    6    Drop what you're doing and drive here. After I...
    Name: text, dtype: object

然后，

from sklearn.feature_extraction.text import CountVectorizer
cv = CountVectorizer()
X = cv.fit_transform(X)

但是我得到这个错误：

---------------------------------------------------------------------------
AttributeError                            Traceback (most recent call last)
<ipython-input-61-8ff79b91e317> in <module>()
----> 1 X = cv.fit_transform(X)

~/anaconda3/lib/python3.6/site-packages/sklearn/feature_extraction/text.py in fit_transform(self, raw_documents, y)
    867 
    868         vocabulary, X = self._count_vocab(raw_documents,
--> 869                                           self.fixed_vocabulary_)
    870 
    871         if self.binary:

~/anaconda3/lib/python3.6/site-packages/sklearn/feature_extraction/text.py in _count_vocab(self, raw_documents, fixed_vocab)
    790         for doc in raw_documents:
    791             feature_counter = {}
--> 792             for feature in analyze(doc):
    793                 try:
    794                     feature_idx = vocabulary[feature]

~/anaconda3/lib/python3.6/site-packages/sklearn/feature_extraction/text.py in <lambda>(doc)
    264 
    265             return lambda doc: self._word_ngrams(
--> 266                 tokenize(preprocess(self.decode(doc))), stop_words)
    267 
    268         else:

~/anaconda3/lib/python3.6/site-packages/sklearn/feature_extraction/text.py in <lambda>(x)
    230 
    231         if self.lowercase:
--> 232             return lambda x: strip_accents(x.lower())
    233         else:
    234             return strip_accents

~/anaconda3/lib/python3.6/site-packages/scipy/sparse/base.py in __getattr__(self, attr)
    574             return self.getnnz()
    575         else:
--> 576             raise AttributeError(attr + " not found")
    577 
    578     def transpose(self, axes=None, copy=False):

AttributeError: lower not found

不知道为什么。

Answer 1

即使数据框只有一列，您也需要指定文本数据的列名。

X_countMatrix = cv.fit_transform(X['text'])

因为 CountVectorizer 需要一个可迭代对象作为输入，并且当您提供数据帧作为参数时，唯一被迭代的是列名。所以即使你没有任何错误，那也是不正确的。幸运的是，您遇到了错误并有机会更正它。

fit_transform 使用 CountVectorizer 时出错

fit_transform error using CountVectorizer

python-3.x

scikit-learn

countvectorizer