如何计算每个分类器的 k 折交叉验证和性能标准开发?
How to compute k-fold cross validation and standard dev of performance for each classifier?
我需要(根据提示)对 3 种算法中的每一种算法 "compute the n-fold cross validation as well as mean and standard deviation of the performance measure on the n folds"。
我的原始数据框结构如下,其中有 16 种类型重复:
target type post
1 intj "hello world shdjd"
2 entp "hello world fddf"
16 estj "hello world dsd"
4 esfp "hello world sfs"
1 intj "hello world ddfd"
我已经像这样训练和计算朴素贝叶斯、支持向量机和逻辑回归的准确性:
text_clf3 = Pipeline([
('vect', CountVectorizer()),
('tfidf', TfidfTransformer()),
('clf', LogisticRegression(multi_class = 'multinomial', solver = 'newton-cg')),
])
text_clf3.fit(result.post, result.target)
predicted3 = text_clf3.predict(docs_test)
print("Logistics Regression: ")
print(np.mean(predicted3 == result.target))
clf 为
LogisticRegression(multi_class = 'multinomial', solver = 'newton-cg')
SGDClassifier(loss='hinge', penalty='l2',
alpha=1e-3, random_state=42,
max_iter=5, tol=None)
和
MultinomialNB(alpha = 0.0001)
我可以为每个模型得到 (metrics.classification_report(result.target, predicted3)
,但不知道如何实现交叉验证。
我该怎么做?
我无法对此进行测试,因为我没有数据集,但下面的代码有望使主要思想清晰。在下面的代码中,all_post
表示所有样本组合,根据您的示例,result.post
和 docs_test
,并且 n
假定为 10。
from sklearn.model_selection import cross_val_score
models = {'lr':LogisticRegression(multi_class = 'multinomial', solver = 'newton-cg'),
'nb':MultinomialNB(alpha = 0.0001),
'sgd':SGDClassifier(loss='hinge', penalty='l2',alpha=1e-3, random_state=42,
max_iter=5, tol=None)}
for name,clf in models.items():
pipe = Pipeline([('vect', CountVectorizer()),
('tfidf', TfidfTransformer()),
('clf', clf)])
res = cross_val_score(pipe,all_post,all_target,cv=10) #res is an array of size 10
print(name,res.mean(),res.std())
我需要(根据提示)对 3 种算法中的每一种算法 "compute the n-fold cross validation as well as mean and standard deviation of the performance measure on the n folds"。
我的原始数据框结构如下,其中有 16 种类型重复:
target type post
1 intj "hello world shdjd"
2 entp "hello world fddf"
16 estj "hello world dsd"
4 esfp "hello world sfs"
1 intj "hello world ddfd"
我已经像这样训练和计算朴素贝叶斯、支持向量机和逻辑回归的准确性:
text_clf3 = Pipeline([
('vect', CountVectorizer()),
('tfidf', TfidfTransformer()),
('clf', LogisticRegression(multi_class = 'multinomial', solver = 'newton-cg')),
])
text_clf3.fit(result.post, result.target)
predicted3 = text_clf3.predict(docs_test)
print("Logistics Regression: ")
print(np.mean(predicted3 == result.target))
clf 为
LogisticRegression(multi_class = 'multinomial', solver = 'newton-cg')
SGDClassifier(loss='hinge', penalty='l2',
alpha=1e-3, random_state=42,
max_iter=5, tol=None)
和
MultinomialNB(alpha = 0.0001)
我可以为每个模型得到 (metrics.classification_report(result.target, predicted3)
,但不知道如何实现交叉验证。
我该怎么做?
我无法对此进行测试,因为我没有数据集,但下面的代码有望使主要思想清晰。在下面的代码中,all_post
表示所有样本组合,根据您的示例,result.post
和 docs_test
,并且 n
假定为 10。
from sklearn.model_selection import cross_val_score
models = {'lr':LogisticRegression(multi_class = 'multinomial', solver = 'newton-cg'),
'nb':MultinomialNB(alpha = 0.0001),
'sgd':SGDClassifier(loss='hinge', penalty='l2',alpha=1e-3, random_state=42,
max_iter=5, tol=None)}
for name,clf in models.items():
pipe = Pipeline([('vect', CountVectorizer()),
('tfidf', TfidfTransformer()),
('clf', clf)])
res = cross_val_score(pipe,all_post,all_target,cv=10) #res is an array of size 10
print(name,res.mean(),res.std())