如何使用交叉验证和预测标签来测试看不见的测试数据?
How to test unseen test data with cross validation and predict labels?
1.The 包含数据(即文本描述)以及分类标签的 CSV
df = pd.read_csv('./output/csv_sanitized_16_.csv', dtype=str)
X = df['description_plus']
y = df['category_id']
2.This CSV 包含需要预测标签的未见数据(即文本描述)
df_2 = pd.read_csv('./output/csv_sanitized_2.csv', dtype=str)
X2 = df_2['description_plus']
对上述训练数据(第 1 项)进行操作的交叉验证函数。
def cross_val():
cv = KFold(n_splits=20)
vectorizer = TfidfVectorizer(sublinear_tf=True, max_df=0.5,
stop_words='english')
X_train = vectorizer.fit_transform(X)
clf = make_pipeline(preprocessing.StandardScaler(with_mean=False), svm.SVC(C=1))
scores = cross_val_score(clf, X_train, y, cv=cv)
print(scores)
print("Accuracy: %0.2f (+/- %0.2f)" % (scores.mean(), scores.std() * 2))
cross_val()
我需要知道如何将看不见的数据(第 2 项)传递给交叉验证函数以及如何预测标签?
使用scores = cross_val_score(clf, X_train, y, cv=cv)
你只能得到模型的交叉验证分数。 cross_val_score
将根据 cv
参数在内部将数据拆分为训练和测试。
因此,您获得的值是 SVC 的交叉验证准确性。
要获得未见数据的分数,您可以先拟合模型,例如
clf = make_pipeline(preprocessing.StandardScaler(with_mean=False), svm.SVC(C=1))
clf.fit(X_train, y) # the model is trained now
然后clf.score(X_unseen,y)
最后一个 return 模型在未见数据上的准确性。
编辑:做你想做的最好的方法是使用 GridSearch 首先使用训练数据找到最好的模型,然后使用看不见的(测试)数据评估最好的模型:
from sklearn import svm, datasets
from sklearn.model_selection import GridSearchCV
from sklearn.model_selection import train_test_split
from sklearn.model_selection import cross_val_score
# load some data
iris = datasets.load_iris()
X, y = iris.data, iris.target
#split data to training and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, random_state=0)
# hyperparameter tunig of the SVC model
parameters = {'kernel':('linear', 'rbf'), 'C':[1, 10]}
svc = svm.SVC()
# fit the GridSearch using the TRAINING data
grid_searcher = GridSearchCV(svc, parameters)
grid_searcher.fit(X_train, y_train)
#recover the best estimator (best parameters for the SVC, based on the GridSearch)
best_SVC_model = grid_searcher.best_estimator_
# Now, check how this best model behaves on the test set
cv_scores_on_unseen = cross_val_score(best_SVC_model, X_test, y_test, cv=5)
print(cv_scores_on_unseen.mean())
1.The 包含数据(即文本描述)以及分类标签的 CSV
df = pd.read_csv('./output/csv_sanitized_16_.csv', dtype=str)
X = df['description_plus']
y = df['category_id']
2.This CSV 包含需要预测标签的未见数据(即文本描述)
df_2 = pd.read_csv('./output/csv_sanitized_2.csv', dtype=str)
X2 = df_2['description_plus']
对上述训练数据(第 1 项)进行操作的交叉验证函数。
def cross_val():
cv = KFold(n_splits=20)
vectorizer = TfidfVectorizer(sublinear_tf=True, max_df=0.5,
stop_words='english')
X_train = vectorizer.fit_transform(X)
clf = make_pipeline(preprocessing.StandardScaler(with_mean=False), svm.SVC(C=1))
scores = cross_val_score(clf, X_train, y, cv=cv)
print(scores)
print("Accuracy: %0.2f (+/- %0.2f)" % (scores.mean(), scores.std() * 2))
cross_val()
我需要知道如何将看不见的数据(第 2 项)传递给交叉验证函数以及如何预测标签?
使用scores = cross_val_score(clf, X_train, y, cv=cv)
你只能得到模型的交叉验证分数。 cross_val_score
将根据 cv
参数在内部将数据拆分为训练和测试。
因此,您获得的值是 SVC 的交叉验证准确性。
要获得未见数据的分数,您可以先拟合模型,例如
clf = make_pipeline(preprocessing.StandardScaler(with_mean=False), svm.SVC(C=1))
clf.fit(X_train, y) # the model is trained now
然后clf.score(X_unseen,y)
最后一个 return 模型在未见数据上的准确性。
编辑:做你想做的最好的方法是使用 GridSearch 首先使用训练数据找到最好的模型,然后使用看不见的(测试)数据评估最好的模型:
from sklearn import svm, datasets
from sklearn.model_selection import GridSearchCV
from sklearn.model_selection import train_test_split
from sklearn.model_selection import cross_val_score
# load some data
iris = datasets.load_iris()
X, y = iris.data, iris.target
#split data to training and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, random_state=0)
# hyperparameter tunig of the SVC model
parameters = {'kernel':('linear', 'rbf'), 'C':[1, 10]}
svc = svm.SVC()
# fit the GridSearch using the TRAINING data
grid_searcher = GridSearchCV(svc, parameters)
grid_searcher.fit(X_train, y_train)
#recover the best estimator (best parameters for the SVC, based on the GridSearch)
best_SVC_model = grid_searcher.best_estimator_
# Now, check how this best model behaves on the test set
cv_scores_on_unseen = cross_val_score(best_SVC_model, X_test, y_test, cv=5)
print(cv_scores_on_unseen.mean())