使用 scikit-learn 进行监督式机器学习
Supervised machine learning with scikit-learn
这是我第一次做有监督的机器学习。这是一个非常高级的主题(至少对我而言),我发现很难指定一个问题,因为我不确定出了什么问题。
# Create a training list and test list (looks something like this):
train = [('this hostel was nice',2),('i hate this hostel',1)]
test = [('had a wonderful time',2),('terrible experience',1)]
# Loading modules
from sklearn.naive_bayes import MultinomialNB
from sklearn.feature_extraction.text import CountVectorizer
from sklearn import metrics
# Use a BOW representation of the reviews
vectorizer = CountVectorizer(stop_words='english')
train_features = vectorizer.fit_transform([r[0] for r in train])
test_features = vectorizer.fit([r[0] for r in test])
# Fit a naive bayes model to the training data
nb = MultinomialNB()
nb.fit(train_features, [r[1] for r in train])
# Use the classifier to predict classification of test dataset
predictions = nb.predict(test_features)
actual=[r[1] for r in test]
这里我得到错误:
float() argument must be a string or a number, not 'CountVectorizer'
这让我感到困惑,因为我压缩在评论中的原始评分是:
type(ratings_new[0])
int
你应该换行
test_features = vectorizer.fit([r[0] for r in test])
至:
test_features = vectorizer.transform([r[0] for r in test])
原因是您已经使用了训练数据来拟合向量化器,因此您不需要在测试数据上再次进行拟合。相反,您需要对其进行转换。
这是我第一次做有监督的机器学习。这是一个非常高级的主题(至少对我而言),我发现很难指定一个问题,因为我不确定出了什么问题。
# Create a training list and test list (looks something like this):
train = [('this hostel was nice',2),('i hate this hostel',1)]
test = [('had a wonderful time',2),('terrible experience',1)]
# Loading modules
from sklearn.naive_bayes import MultinomialNB
from sklearn.feature_extraction.text import CountVectorizer
from sklearn import metrics
# Use a BOW representation of the reviews
vectorizer = CountVectorizer(stop_words='english')
train_features = vectorizer.fit_transform([r[0] for r in train])
test_features = vectorizer.fit([r[0] for r in test])
# Fit a naive bayes model to the training data
nb = MultinomialNB()
nb.fit(train_features, [r[1] for r in train])
# Use the classifier to predict classification of test dataset
predictions = nb.predict(test_features)
actual=[r[1] for r in test]
这里我得到错误:
float() argument must be a string or a number, not 'CountVectorizer'
这让我感到困惑,因为我压缩在评论中的原始评分是:
type(ratings_new[0])
int
你应该换行
test_features = vectorizer.fit([r[0] for r in test])
至:
test_features = vectorizer.transform([r[0] for r in test])
原因是您已经使用了训练数据来拟合向量化器,因此您不需要在测试数据上再次进行拟合。相反,您需要对其进行转换。