Error: 'int' object has no attribute 'lower' - with regards to CountVectorizer and Pandas

Question

我无法将 CountVectorizer 应用于 Excel 导入的数据集。我尝试将数据中的所有整数交换为一个字符串，但 CountVectorizer 仍然注册整数。

import numpy as np
import sklearn
import pandas as pd
from sklearn.feature_extraction.text import CountVectorizer as cv
from sklearn.linear_model import Perceptron
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split


pos = pd.read_excel("/content/drive/My Drive/Polarity_pos.xlsx", header = None, names=None)

neg = pos = pd.read_excel("/content/drive/My Drive/Polarity_neg.xlsx", header = None, names=None)


merged_train = pd.merge(pos,neg)


string = merged_train.astype('str')

train=pd.DataFrame(data=string).replace('\d+','NUM',regex=True)


print(train.loc[19,:])


#analyzer='word',stop_words=None,analyzer = 'word' 
vectorizer = cv()
count_vector = vectorizer.fit_transform(train)

出现错误：

AttributeError                            Traceback (most recent call last)
<ipython-input-116-adcd263d8e89> in <module>()
     26 #analyzer='word',stop_words=None,analyzer = 'word'
     27 vectorizer = cv()
---> 28 count_vector = vectorizer.fit_transform(train)
     29 
     30 

3 frames
/usr/local/lib/python3.6/dist-packages/sklearn/feature_extraction/text.py in _preprocess(doc, accent_function, lower)
     66     """
     67     if lower:
---> 68         doc = doc.lower()
     69     if accent_function is not None:
     70         doc = accent_function(doc)

AttributeError: 'int' object has no attribute 'lower'

Answer 1

可能是您为 CountVectorizer 向 fit_transform 提供了错误的输入。它不需要数据框，而是“可迭代的原始文本文档”。请参阅 docs. 因此您可以尝试展平数据框，然后使用矢量化器。但要确保你所做的仍然适合你的问题。试试这个：

count_vector = vectorizer.fit_transform(train.stack())

其中 train.stack() 将您的数据框转换为系列。

Error: 'int' object has no attribute 'lower' - with regards to CountVectorizer and Pandas

Error: 'int' object has no attribute 'lower' - with regards to CountVectorizer and Pandas

python

pandas

scikit-learn

countvectorizer