Error: 'int' object has no attribute 'lower' - with regards to CountVectorizer and Pandas
Error: 'int' object has no attribute 'lower' - with regards to CountVectorizer and Pandas
我无法将 CountVectorizer 应用于 Excel 导入的数据集。我尝试将数据中的所有整数交换为一个字符串,但 CountVectorizer 仍然注册整数。
import numpy as np
import sklearn
import pandas as pd
from sklearn.feature_extraction.text import CountVectorizer as cv
from sklearn.linear_model import Perceptron
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
pos = pd.read_excel("/content/drive/My Drive/Polarity_pos.xlsx", header = None, names=None)
neg = pos = pd.read_excel("/content/drive/My Drive/Polarity_neg.xlsx", header = None, names=None)
merged_train = pd.merge(pos,neg)
string = merged_train.astype('str')
train=pd.DataFrame(data=string).replace('\d+','NUM',regex=True)
print(train.loc[19,:])
#analyzer='word',stop_words=None,analyzer = 'word'
vectorizer = cv()
count_vector = vectorizer.fit_transform(train)
出现错误:
AttributeError Traceback (most recent call last)
<ipython-input-116-adcd263d8e89> in <module>()
26 #analyzer='word',stop_words=None,analyzer = 'word'
27 vectorizer = cv()
---> 28 count_vector = vectorizer.fit_transform(train)
29
30
3 frames
/usr/local/lib/python3.6/dist-packages/sklearn/feature_extraction/text.py in _preprocess(doc, accent_function, lower)
66 """
67 if lower:
---> 68 doc = doc.lower()
69 if accent_function is not None:
70 doc = accent_function(doc)
AttributeError: 'int' object has no attribute 'lower'
可能是您为 CountVectorizer
向 fit_transform
提供了错误的输入。它不需要数据框,而是“可迭代的原始文本文档”。请参阅 docs. 因此您可以尝试展平数据框,然后使用矢量化器。但要确保你所做的仍然适合你的问题。试试这个:
count_vector = vectorizer.fit_transform(train.stack())
其中 train.stack()
将您的数据框转换为系列。
我无法将 CountVectorizer 应用于 Excel 导入的数据集。我尝试将数据中的所有整数交换为一个字符串,但 CountVectorizer 仍然注册整数。
import numpy as np
import sklearn
import pandas as pd
from sklearn.feature_extraction.text import CountVectorizer as cv
from sklearn.linear_model import Perceptron
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
pos = pd.read_excel("/content/drive/My Drive/Polarity_pos.xlsx", header = None, names=None)
neg = pos = pd.read_excel("/content/drive/My Drive/Polarity_neg.xlsx", header = None, names=None)
merged_train = pd.merge(pos,neg)
string = merged_train.astype('str')
train=pd.DataFrame(data=string).replace('\d+','NUM',regex=True)
print(train.loc[19,:])
#analyzer='word',stop_words=None,analyzer = 'word'
vectorizer = cv()
count_vector = vectorizer.fit_transform(train)
出现错误:
AttributeError Traceback (most recent call last)
<ipython-input-116-adcd263d8e89> in <module>()
26 #analyzer='word',stop_words=None,analyzer = 'word'
27 vectorizer = cv()
---> 28 count_vector = vectorizer.fit_transform(train)
29
30
3 frames
/usr/local/lib/python3.6/dist-packages/sklearn/feature_extraction/text.py in _preprocess(doc, accent_function, lower)
66 """
67 if lower:
---> 68 doc = doc.lower()
69 if accent_function is not None:
70 doc = accent_function(doc)
AttributeError: 'int' object has no attribute 'lower'
可能是您为 CountVectorizer
向 fit_transform
提供了错误的输入。它不需要数据框,而是“可迭代的原始文本文档”。请参阅 docs. 因此您可以尝试展平数据框,然后使用矢量化器。但要确保你所做的仍然适合你的问题。试试这个:
count_vector = vectorizer.fit_transform(train.stack())
其中 train.stack()
将您的数据框转换为系列。