calibrated classifier ValueError: could not convert string to float
calibrated classifier ValueError: could not convert string to float
数据框:
id review name label
1 it is a great product for turning lights on. Ashley
2 plays music and have a good sound. Alex
3 I love it, lots of fun. Peter
我想使用概率分类器 (linear_svc) 根据评论预测标签(概率为 1)。我的代码:
from sklearn.svm import LinearSVC
from sklearn.calibration import CalibratedClassifierCV
from sklearn import datasets
#Load dataset
X = training['review']
y = training['label']
linear_svc = LinearSVC() #The base estimator
# This is the calibrated classifier which can give probabilistic classifier
calibrated_svc = CalibratedClassifierCV(linear_svc,
method='sigmoid', #sigmoid will use Platt's scaling. Refer to documentation for other methods.
cv=3)
calibrated_svc.fit(X, y)
# predict
prediction_data = predict_data['review']
predicted_probs = calibrated_svc.predict_proba(prediction_data)
它在 calibrated_svc.fit(X, y) 上给出以下错误:
ValueError: could not convert string to float: 'it is a great product
for turning...'
非常感谢你的帮助。
试试这个:
from sklearn.feature_extraction.text import TfidfVectorizer
X = training['review']
y = training['label']
prediction_data = predict_data['review']
tfv = TfidfVectorizer(min_df=1, stop_words = 'english')
tfv.fit(list(X) + list(prediction_data))
X = tfv.transform(X)
prediction_data = tfv.transform(prediction_data)
然后构建模型:
linear_svc = LinearSVC()
calibrated_svc = CalibratedClassifierCV(linear_svc, method='sigmoid', cv=3)
calibrated_svc.fit(X, y)
SVM 模型无法直接处理文本数据。您需要先从文本中提取一些数字特征。我推荐阅读一些关于 NLP 的内容,例如 Bag of Words 和 TF-IDF。无论如何,对于您建议的示例,功能最小的管道将是:
from sklearn.calibration import CalibratedClassifierCV
from sklearn import datasets
from sklearn.pipeline import make_pipeline
from sklearn.feature_extraction.text import TfidfVectorizer
#Load dataset
X = training['review']
y = training['label']
linear_svc = make_pipeline(TfIdfVectorizer(), LinearSVC())
# This is the calibrated classifier which can give probabilistic classifier
calibrated_svc = CalibratedClassifierCV(linear_svc,
method='sigmoid',
cv=3)
calibrated_svc.fit(X, y)
# predict
prediction_data = predict_data['review']
predicted_probs = calibrated_svc.predict_proba(prediction_data)
您可能还想通过删除特殊字符、小写字母、词干提取等来稍微清理一下文本。看看 spacy 文本处理库。
数据框:
id review name label
1 it is a great product for turning lights on. Ashley
2 plays music and have a good sound. Alex
3 I love it, lots of fun. Peter
我想使用概率分类器 (linear_svc) 根据评论预测标签(概率为 1)。我的代码:
from sklearn.svm import LinearSVC
from sklearn.calibration import CalibratedClassifierCV
from sklearn import datasets
#Load dataset
X = training['review']
y = training['label']
linear_svc = LinearSVC() #The base estimator
# This is the calibrated classifier which can give probabilistic classifier
calibrated_svc = CalibratedClassifierCV(linear_svc,
method='sigmoid', #sigmoid will use Platt's scaling. Refer to documentation for other methods.
cv=3)
calibrated_svc.fit(X, y)
# predict
prediction_data = predict_data['review']
predicted_probs = calibrated_svc.predict_proba(prediction_data)
它在 calibrated_svc.fit(X, y) 上给出以下错误:
ValueError: could not convert string to float: 'it is a great product for turning...'
非常感谢你的帮助。
试试这个:
from sklearn.feature_extraction.text import TfidfVectorizer
X = training['review']
y = training['label']
prediction_data = predict_data['review']
tfv = TfidfVectorizer(min_df=1, stop_words = 'english')
tfv.fit(list(X) + list(prediction_data))
X = tfv.transform(X)
prediction_data = tfv.transform(prediction_data)
然后构建模型:
linear_svc = LinearSVC()
calibrated_svc = CalibratedClassifierCV(linear_svc, method='sigmoid', cv=3)
calibrated_svc.fit(X, y)
SVM 模型无法直接处理文本数据。您需要先从文本中提取一些数字特征。我推荐阅读一些关于 NLP 的内容,例如 Bag of Words 和 TF-IDF。无论如何,对于您建议的示例,功能最小的管道将是:
from sklearn.calibration import CalibratedClassifierCV
from sklearn import datasets
from sklearn.pipeline import make_pipeline
from sklearn.feature_extraction.text import TfidfVectorizer
#Load dataset
X = training['review']
y = training['label']
linear_svc = make_pipeline(TfIdfVectorizer(), LinearSVC())
# This is the calibrated classifier which can give probabilistic classifier
calibrated_svc = CalibratedClassifierCV(linear_svc,
method='sigmoid',
cv=3)
calibrated_svc.fit(X, y)
# predict
prediction_data = predict_data['review']
predicted_probs = calibrated_svc.predict_proba(prediction_data)
您可能还想通过删除特殊字符、小写字母、词干提取等来稍微清理一下文本。看看 spacy 文本处理库。