cross_val_score 正在返回 scikit 学习中的 nan 分数列表
cross_val_score is returning nan list of scores in scikit learn
我正在尝试使用 cross validation
处理不平衡 multi label dataset
,但 scikit learn cross_val_score
在 运行 分类器上返回 nan list of values
。
这是代码:
import pandas as pd
import numpy as np
data = pd.DataFrame.from_dict(dict, orient = 'index') # save the given data below in dict variable to run this line
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.preprocessing import MultiLabelBinarizer
from sklearn.multiclass import OneVsRestClassifier
multilabel = MultiLabelBinarizer()
y = multilabel.fit_transform(data['Tags'])
from nltk.corpus import stopwords
stop_words = set(stopwords.words('english'))
tfidf = TfidfVectorizer(stop_words = stop_words,max_features= 40000, ngram_range = (1,3))
X = tfidf.fit_transform(data['cleaned_title'])
from skmultilearn.model_selection import IterativeStratification
k_fold = IterativeStratification(n_splits=10, order=1)
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import jaccard_score
class_weight = {0:1,1:10}
lr = LogisticRegression(class_weight = class_weight, n_jobs = -1)
scores = cross_val_score(lr, X, y, cv=k_fold, scoring = 'f1_micro')
scores
这是使用 data.head(10).to_dict()
的数据(前 10 行)
{0: {'Tags': ['python', 'list', 'loops', 'for-loop', 'indexing'],
'cleaned_title': 'for loop we use any local variable what if we use any number present in a list ',
'cleaned_text_of_ques': 'in the for loop we use any local variable what if we use any number in a list what will be the output a [ 1 2 3 4 5 6 ] b [ ] for a[ 1 ] in a b append a[ 1 ] print b '},
1: {'Tags': ['python', 'loops', 'tkinter', 'algorithm-animation'],
'cleaned_title': 'contain a mainloop [ duplicate ]',
'cleaned_text_of_ques': 'my code be a bubble sort that i be try to visualise but i be struggle to find a way to make a block of code only be use once i also think that if i could only mainloop a section that would'},
2: {'Tags': ['android',
'android-lifecycle',
'activity-lifecycle',
'onsaveinstancestate'],
'cleaned_title': 'when onrestoreinstancestate be not call ',
'cleaned_text_of_ques': 'docs describe when onrestoreinstancestate be call this method be call after onstart when the activity be be re initialize from a previously save state give here in savedinstancestate '},
3: {'Tags': ['python', 'r', 'bash', 'conda', 'spyder'],
'cleaned_title': 'point conda r to already instal version of r',
'cleaned_text_of_ques': 'my problem have to do with the fact that rstudio and conda be point to different version of r my r and rstudio be instal independent of anaconda and everything be work great in my '},
4: {'Tags': ['android',
'firebase',
'firebase-realtime-database',
'android-recyclerview'],
'cleaned_title': 'how to use a recycleview with several different layout accord to the datum collect in firebase [ close ]',
'cleaned_text_of_ques': 'i have a problem there be day that i do research and test code but nothing work my application will have a window where i will post datum take in firebase use a recycleview with the'},
5: {'Tags': ['html', 'css', 'layout'],
'cleaned_title': 'how to create side by side layout of an image and label ',
'cleaned_text_of_ques': 'i have be try for a while now and can not seem to achive the bellow design exploreitem background color 353258 rgba 31 31 31 1 border 1px solid 4152f1 color '},
6: {'Tags': ['php', 'jquery', 'file'],
'cleaned_title': 'php jquery ajax _ files[ file ] undefined index error',
'cleaned_text_of_ques': 'i have a form that upload image file and it be not work i have try submit and click event the error appear when i have remove the if statement thank in advance for your help '},
7: {'Tags': ['python', 'pandas', 'dataframe'],
'cleaned_title': 'how to update value in pandas dataframe in a for loop ',
'cleaned_text_of_ques': 'i be try to make a data frame that can store variable coeff value after each iteration i be able to plot the graph after each iteration but when i try to insert the value in the data frame'},
8: {'Tags': ['xpath', 'web-scraping', 'scrapy'],
'cleaned_title': 'scrapy how can i handle a random number of element ',
'cleaned_text_of_ques': 'i have a scrapy crawler that i can comfortably acquire the first desire paragraph but sometimes there be a second or third paragraph response xpath f string h2[contains text card ] '},
9: {'Tags': ['bootstrap-4', 'tabs', 'collapse'],
'cleaned_title': 'collapse three column with bootstrap',
'cleaned_text_of_ques': 'i be try to make three tab with cross reference with one tab visible at the time i be use the bootstrap v4 collapse scheme with functionality support by jquery here be the example https '}}
这就是我在 scores
变量中获取 cross_val_score
的方式
array([nan, nan, nan, nan, nan, nan, nan, nan, nan, nan])
。它的每个值都应该在 0-1
范围内。然而,这发生在 every algorithm model
.
我认为你需要更改分数线内的模型:
scores = cross_val_score(lr, X, y, cv=k_fold, scoring = 'f1_micro')
scores
您有一个多标签数据集,这意味着您的 y 变量在转换后将有超过 1 列,逻辑回归对其不起作用:
lr.fit(X,y)
ValueError: y should be a 1d array, got an array of shape (10, 32) instead.
这就是你得到 nan 的原因。您需要选择一个分类器,请参阅 the sklearn helpage 了解选项。另外,我不确定 IterativeStratification
是否适用于多标签,所以如果你使用 KFold 它会起作用:
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import KFold
kf = KFold(n_splits=5)
clf = DecisionTreeClassifier()
scores = cross_val_score(clf, X, y, cv=kf, scoring = 'f1_micro')
我正在尝试使用 cross validation
处理不平衡 multi label dataset
,但 scikit learn cross_val_score
在 运行 分类器上返回 nan list of values
。
这是代码:
import pandas as pd
import numpy as np
data = pd.DataFrame.from_dict(dict, orient = 'index') # save the given data below in dict variable to run this line
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.preprocessing import MultiLabelBinarizer
from sklearn.multiclass import OneVsRestClassifier
multilabel = MultiLabelBinarizer()
y = multilabel.fit_transform(data['Tags'])
from nltk.corpus import stopwords
stop_words = set(stopwords.words('english'))
tfidf = TfidfVectorizer(stop_words = stop_words,max_features= 40000, ngram_range = (1,3))
X = tfidf.fit_transform(data['cleaned_title'])
from skmultilearn.model_selection import IterativeStratification
k_fold = IterativeStratification(n_splits=10, order=1)
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import jaccard_score
class_weight = {0:1,1:10}
lr = LogisticRegression(class_weight = class_weight, n_jobs = -1)
scores = cross_val_score(lr, X, y, cv=k_fold, scoring = 'f1_micro')
scores
这是使用 data.head(10).to_dict()
{0: {'Tags': ['python', 'list', 'loops', 'for-loop', 'indexing'],
'cleaned_title': 'for loop we use any local variable what if we use any number present in a list ',
'cleaned_text_of_ques': 'in the for loop we use any local variable what if we use any number in a list what will be the output a [ 1 2 3 4 5 6 ] b [ ] for a[ 1 ] in a b append a[ 1 ] print b '},
1: {'Tags': ['python', 'loops', 'tkinter', 'algorithm-animation'],
'cleaned_title': 'contain a mainloop [ duplicate ]',
'cleaned_text_of_ques': 'my code be a bubble sort that i be try to visualise but i be struggle to find a way to make a block of code only be use once i also think that if i could only mainloop a section that would'},
2: {'Tags': ['android',
'android-lifecycle',
'activity-lifecycle',
'onsaveinstancestate'],
'cleaned_title': 'when onrestoreinstancestate be not call ',
'cleaned_text_of_ques': 'docs describe when onrestoreinstancestate be call this method be call after onstart when the activity be be re initialize from a previously save state give here in savedinstancestate '},
3: {'Tags': ['python', 'r', 'bash', 'conda', 'spyder'],
'cleaned_title': 'point conda r to already instal version of r',
'cleaned_text_of_ques': 'my problem have to do with the fact that rstudio and conda be point to different version of r my r and rstudio be instal independent of anaconda and everything be work great in my '},
4: {'Tags': ['android',
'firebase',
'firebase-realtime-database',
'android-recyclerview'],
'cleaned_title': 'how to use a recycleview with several different layout accord to the datum collect in firebase [ close ]',
'cleaned_text_of_ques': 'i have a problem there be day that i do research and test code but nothing work my application will have a window where i will post datum take in firebase use a recycleview with the'},
5: {'Tags': ['html', 'css', 'layout'],
'cleaned_title': 'how to create side by side layout of an image and label ',
'cleaned_text_of_ques': 'i have be try for a while now and can not seem to achive the bellow design exploreitem background color 353258 rgba 31 31 31 1 border 1px solid 4152f1 color '},
6: {'Tags': ['php', 'jquery', 'file'],
'cleaned_title': 'php jquery ajax _ files[ file ] undefined index error',
'cleaned_text_of_ques': 'i have a form that upload image file and it be not work i have try submit and click event the error appear when i have remove the if statement thank in advance for your help '},
7: {'Tags': ['python', 'pandas', 'dataframe'],
'cleaned_title': 'how to update value in pandas dataframe in a for loop ',
'cleaned_text_of_ques': 'i be try to make a data frame that can store variable coeff value after each iteration i be able to plot the graph after each iteration but when i try to insert the value in the data frame'},
8: {'Tags': ['xpath', 'web-scraping', 'scrapy'],
'cleaned_title': 'scrapy how can i handle a random number of element ',
'cleaned_text_of_ques': 'i have a scrapy crawler that i can comfortably acquire the first desire paragraph but sometimes there be a second or third paragraph response xpath f string h2[contains text card ] '},
9: {'Tags': ['bootstrap-4', 'tabs', 'collapse'],
'cleaned_title': 'collapse three column with bootstrap',
'cleaned_text_of_ques': 'i be try to make three tab with cross reference with one tab visible at the time i be use the bootstrap v4 collapse scheme with functionality support by jquery here be the example https '}}
这就是我在 scores
变量中获取 cross_val_score
的方式
array([nan, nan, nan, nan, nan, nan, nan, nan, nan, nan])
。它的每个值都应该在 0-1
范围内。然而,这发生在 every algorithm model
.
我认为你需要更改分数线内的模型:
scores = cross_val_score(lr, X, y, cv=k_fold, scoring = 'f1_micro')
scores
您有一个多标签数据集,这意味着您的 y 变量在转换后将有超过 1 列,逻辑回归对其不起作用:
lr.fit(X,y)
ValueError: y should be a 1d array, got an array of shape (10, 32) instead.
这就是你得到 nan 的原因。您需要选择一个分类器,请参阅 the sklearn helpage 了解选项。另外,我不确定 IterativeStratification
是否适用于多标签,所以如果你使用 KFold 它会起作用:
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import KFold
kf = KFold(n_splits=5)
clf = DecisionTreeClassifier()
scores = cross_val_score(clf, X, y, cv=kf, scoring = 'f1_micro')