Countvectorizer scikit-learn 中的类型错误:预期的字符串或缓冲区
TypeError in Countvectorizer scikit-learn: Expected string or buffer
我正在尝试解决分类问题。当我将文本输入 CountVectorizer 时出现错误:
expected string or buffer.
我的数据集有什么问题吗,因为它包含数字和单词的消息混合,甚至消息中也包含特殊字符。
示例消息如下所示:
0 I have not received my gifts which I ordered ok
1 hth her wells idyll McGill kooky bbc.co
2 test test test 1 test
3 test
4 hello where is my reward points
5 hi, can you get koovs coupons or vouchers here...
这是我用来做分类的代码:
import pandas as pd
from sklearn.feature_extraction.text import CountVectorizer
df = pd.read_excel('training_data.xlsx')
X_train = df.message
print X_train.shape
map_class_label = {'checkin':0, 'greeting':1,'more reward options':2,'noclass':3, 'other':4,'points':5,
'referral points':6,'snapbill':7, 'thanks':8,'voucher not working':9,'voucher':10}
df['label_num'] = df['Final Category'].map(map_class_label)
y_train = df.label_num
vectorizer = CountVectorizer(lowercase=False,decode_error='ignore')
X_train_dtm = vectorizer.fit_transform(X_train)
您需要通过 astype
将列 message
转换为 string
,因为数据中有一些数值:
df = pd.read_excel('training_data.xlsx')
df['message'] = df['message'].values.astype('unicode')
...
...
我通过只传递一个字符串得到了同样的错误,如下所示:
cv.fit_transform('Making my way down,')
相反,您必须传递一个带有字符串的列表,如下所示:
cv.fit_transform(['Making my way down,', ])
我正在尝试解决分类问题。当我将文本输入 CountVectorizer 时出现错误:
expected string or buffer.
我的数据集有什么问题吗,因为它包含数字和单词的消息混合,甚至消息中也包含特殊字符。
示例消息如下所示:
0 I have not received my gifts which I ordered ok
1 hth her wells idyll McGill kooky bbc.co
2 test test test 1 test
3 test
4 hello where is my reward points
5 hi, can you get koovs coupons or vouchers here...
这是我用来做分类的代码:
import pandas as pd
from sklearn.feature_extraction.text import CountVectorizer
df = pd.read_excel('training_data.xlsx')
X_train = df.message
print X_train.shape
map_class_label = {'checkin':0, 'greeting':1,'more reward options':2,'noclass':3, 'other':4,'points':5,
'referral points':6,'snapbill':7, 'thanks':8,'voucher not working':9,'voucher':10}
df['label_num'] = df['Final Category'].map(map_class_label)
y_train = df.label_num
vectorizer = CountVectorizer(lowercase=False,decode_error='ignore')
X_train_dtm = vectorizer.fit_transform(X_train)
您需要通过 astype
将列 message
转换为 string
,因为数据中有一些数值:
df = pd.read_excel('training_data.xlsx')
df['message'] = df['message'].values.astype('unicode')
...
...
我通过只传递一个字符串得到了同样的错误,如下所示:
cv.fit_transform('Making my way down,')
相反,您必须传递一个带有字符串的列表,如下所示:
cv.fit_transform(['Making my way down,', ])