为什么我会收到一条错误消息,指出测试数据的特征数量较少?
Why am I getting an error saying that test data has lesser number of features?
我正在尝试在包含 25000 条电影评论的数据集上实施 LinearSVC
模型。 12,500 条是正面标记的评论,其余是负面的。我正在尝试使用 TfidfVectorizer
.
对数据进行矢量化
这是我的代码:
import pandas as pd
from sklearn.svm import LinearSVC
from sklearn.feature_extraction.text import TfidfVectorizer
k=0
# reading training data from data-set and taking test data as input
rev=input('Enter:')
rev=rev.replace("<br />", " ")
data_folder= "C:/Users/Files/Desktop/Dataset/train"
for sentiment in (["/neg/","/pos/"]):
#path would be the directory path
path=data_folder+sentiment
#filename will store the NAME of the files that you want to access
for filename in sorted(os.listdir(path)):
#open file
with open(path+ filename,"r",encoding= 'utf-8') as f:
review=f.read()
review=review.replace("<br />", " ")
total.append(review)
# removing stop-words from data
for i in total:
for j in stop_words:
if j in i:
i.replace(j,'')
for i in [rev]:
for j in stop_words:
if j in i:
i.replace(j,'')
c=TfidfVectorizer()
f=c.fit_transform(total).toarray()
tst=c.fit_transform([rev]).toarray()
# 0 for negative data and 1 for positive data
while k!=12500:
l.append(0)
k+=1
while k!=25000:
l.append(1)
k+=1
m=LinearSVC(random_state=0,tol=1e-5)
m.fit(f,l)
if(m.predict(tst).tolist().count(1)>m.predict(tst).tolist().count(0)):
print('Positive')
else:
print('Negative')
每次我 运行 这段代码,我都会收到这个错误
ValueError: X has 139 features per sample; expecting 79897
这个错误是什么意思,我该如何解决?
好的,我找到了解决方案,只需要使用 transform()
代替 fit_transform()
来向量化火车数据,即
c=TfidfVectorizer()
f=c.fit_transform(total).toarray()
tst=c.transform([rev]).toarray()
我正在尝试在包含 25000 条电影评论的数据集上实施 LinearSVC
模型。 12,500 条是正面标记的评论,其余是负面的。我正在尝试使用 TfidfVectorizer
.
这是我的代码:
import pandas as pd
from sklearn.svm import LinearSVC
from sklearn.feature_extraction.text import TfidfVectorizer
k=0
# reading training data from data-set and taking test data as input
rev=input('Enter:')
rev=rev.replace("<br />", " ")
data_folder= "C:/Users/Files/Desktop/Dataset/train"
for sentiment in (["/neg/","/pos/"]):
#path would be the directory path
path=data_folder+sentiment
#filename will store the NAME of the files that you want to access
for filename in sorted(os.listdir(path)):
#open file
with open(path+ filename,"r",encoding= 'utf-8') as f:
review=f.read()
review=review.replace("<br />", " ")
total.append(review)
# removing stop-words from data
for i in total:
for j in stop_words:
if j in i:
i.replace(j,'')
for i in [rev]:
for j in stop_words:
if j in i:
i.replace(j,'')
c=TfidfVectorizer()
f=c.fit_transform(total).toarray()
tst=c.fit_transform([rev]).toarray()
# 0 for negative data and 1 for positive data
while k!=12500:
l.append(0)
k+=1
while k!=25000:
l.append(1)
k+=1
m=LinearSVC(random_state=0,tol=1e-5)
m.fit(f,l)
if(m.predict(tst).tolist().count(1)>m.predict(tst).tolist().count(0)):
print('Positive')
else:
print('Negative')
每次我 运行 这段代码,我都会收到这个错误
ValueError: X has 139 features per sample; expecting 79897
这个错误是什么意思,我该如何解决?
好的,我找到了解决方案,只需要使用 transform()
代替 fit_transform()
来向量化火车数据,即
c=TfidfVectorizer()
f=c.fit_transform(total).toarray()
tst=c.transform([rev]).toarray()