AttributeError: 'list' object has no attribute 'lower' from Tfidf_vect.fit
AttributeError: 'list' object has no attribute 'lower' from Tfidf_vect.fit
我正在尝试使用 tf-idf 功能应用 SVM。
但我收到了这个错误:
Traceback (most recent call last):
File "<input>", line 1, in <module>
File "C:\Program Files\JetBrains\PyCharm 2019.1.3\helpers\pydev\_pydev_bundle\pydev_umd.py", line 197, in runfile
pydev_imports.execfile(filename, global_vars, local_vars) # execute the script
File "C:\Program Files\JetBrains\PyCharm 2019.1.3\helpers\pydev\_pydev_imps\_pydev_execfile.py", line 18, in execfile
exec(compile(contents+"\n", file, 'exec'), glob, loc)
File "C:/Users/lam/.PyCharm2019.1/config/scratches/scratch_1.py", line 35, in <module>
Tfidf_vect.fit(data['input'])
File "C:\Users\lam\PycharmProjects\untitled\venv\lib\site-packages\sklearn\feature_extraction\text.py", line 1631, in fit
X = super().fit_transform(raw_documents)
File "C:\Users\lam\PycharmProjects\untitled\venv\lib\site-packages\sklearn\feature_extraction\text.py", line 1058, in fit_transform
self.fixed_vocabulary_)
File "C:\Users\lam\PycharmProjects\untitled\venv\lib\site-packages\sklearn\feature_extraction\text.py", line 970, in _count_vocab
for feature in analyze(doc):
File "C:\Users\lam\PycharmProjects\untitled\venv\lib\site-packages\sklearn\feature_extraction\text.py", line 352, in <lambda>
tokenize(preprocess(self.decode(doc))), stop_words)
File "C:\Users\lam\PycharmProjects\untitled\venv\lib\site-packages\sklearn\feature_extraction\text.py", line 256, in <lambda>
return lambda x: strip_accents(x.lower())
AttributeError: 'list' object has no attribute 'lower'
这是我的代码:
data['input']= [nltk.word_tokenize(entry) for entry in data['input']]
Train_X, Test_X, Train_Y, Test_Y = sklearn.model_selection.train_test_split(data['input'],data['Class'],test_size=0.2)
Encoder = LabelEncoder()
Train_Y = Encoder.fit_transform(Train_Y)
Test_Y = Encoder.fit_transform(Test_Y)
Tfidf_vect = TfidfVectorizer()
Tfidf_vect.fit(data['input'])
Train_X_Tfidf = Tfidf_vect.transform(Train_X)
Test_X_Tfidf = Tfidf_vect.transform(Test_X)
print(Tfidf_vect.vocabulary_)
我正在使用 python 3.6.0,我的数据集是阿拉伯语。
谢谢,
错误表明 TfidfVectorizer
需要一个字符串作为其输入 - 而不是字符串列表。它自己完成所有标记化(但如果需要,您可以在 TfidfVectorizer
内插入自定义标记器)。
所以我会尝试一个更简单的管道,没有第一行(nltk.tokenize..
)。但我不能 100% 确定,因为您没有提供任何导致错误的实际输入数据的示例。
我正在尝试使用 tf-idf 功能应用 SVM。 但我收到了这个错误:
Traceback (most recent call last):
File "<input>", line 1, in <module>
File "C:\Program Files\JetBrains\PyCharm 2019.1.3\helpers\pydev\_pydev_bundle\pydev_umd.py", line 197, in runfile
pydev_imports.execfile(filename, global_vars, local_vars) # execute the script
File "C:\Program Files\JetBrains\PyCharm 2019.1.3\helpers\pydev\_pydev_imps\_pydev_execfile.py", line 18, in execfile
exec(compile(contents+"\n", file, 'exec'), glob, loc)
File "C:/Users/lam/.PyCharm2019.1/config/scratches/scratch_1.py", line 35, in <module>
Tfidf_vect.fit(data['input'])
File "C:\Users\lam\PycharmProjects\untitled\venv\lib\site-packages\sklearn\feature_extraction\text.py", line 1631, in fit
X = super().fit_transform(raw_documents)
File "C:\Users\lam\PycharmProjects\untitled\venv\lib\site-packages\sklearn\feature_extraction\text.py", line 1058, in fit_transform
self.fixed_vocabulary_)
File "C:\Users\lam\PycharmProjects\untitled\venv\lib\site-packages\sklearn\feature_extraction\text.py", line 970, in _count_vocab
for feature in analyze(doc):
File "C:\Users\lam\PycharmProjects\untitled\venv\lib\site-packages\sklearn\feature_extraction\text.py", line 352, in <lambda>
tokenize(preprocess(self.decode(doc))), stop_words)
File "C:\Users\lam\PycharmProjects\untitled\venv\lib\site-packages\sklearn\feature_extraction\text.py", line 256, in <lambda>
return lambda x: strip_accents(x.lower())
AttributeError: 'list' object has no attribute 'lower'
这是我的代码:
data['input']= [nltk.word_tokenize(entry) for entry in data['input']]
Train_X, Test_X, Train_Y, Test_Y = sklearn.model_selection.train_test_split(data['input'],data['Class'],test_size=0.2)
Encoder = LabelEncoder()
Train_Y = Encoder.fit_transform(Train_Y)
Test_Y = Encoder.fit_transform(Test_Y)
Tfidf_vect = TfidfVectorizer()
Tfidf_vect.fit(data['input'])
Train_X_Tfidf = Tfidf_vect.transform(Train_X)
Test_X_Tfidf = Tfidf_vect.transform(Test_X)
print(Tfidf_vect.vocabulary_)
我正在使用 python 3.6.0,我的数据集是阿拉伯语。
谢谢,
错误表明 TfidfVectorizer
需要一个字符串作为其输入 - 而不是字符串列表。它自己完成所有标记化(但如果需要,您可以在 TfidfVectorizer
内插入自定义标记器)。
所以我会尝试一个更简单的管道,没有第一行(nltk.tokenize..
)。但我不能 100% 确定,因为您没有提供任何导致错误的实际输入数据的示例。