我的数据有 14 个属性和 303 个观察值,但是当应用大于 1 的 knn 值时会出现错误
my data has 14 attributes and 303 observations but when applying knn value of k greater than 1 is giving error
我收到这个错误 -
ValueError: Expected n_neighbors <= n_samples, but n_samples = 1, n_neighbors = 11
我使用的数据有 14 个属性和 303 个观察值。我希望邻居的数量为 11(任何大于 1 的值),但每次都会出现此错误。
这是我的代码-
import pandas as pd
header_names = ['age','sex','cp','trestbps','chol','fbs','restecg','thalach','exang','oldpeak','slope','ca','thal','num']
dataset = pd.read_csv('E:/HCU proj doc/EHR dataset/cleveland_cleaned_data.csv', names= header_names)
training_sizes = [1,25,50,75,100,150,200]
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import learning_curve
features = ['age','sex','cp','trestbps','chol','fbs','restecg','thalach','exang','oldpeak','slope','ca','thal']
target = 'num'
from sklearn.neighbors import KNeighborsClassifier
train_size, train_scores, validation_scores = learning_curve(estimator = KNeighborsClassifier(n_neighbors=1), X=dataset[features], y=dataset[target], train_sizes=training_sizes, cv=5, scoring='neg_log_loss')
这里是错误的回溯-
Traceback (most recent call last):
File "E:\HCU proj doc\heart_disease_scaling_and_learning_curve.py", line 15, in <module>
train_size, train_scores, validation_scores = learning_curve(estimator = KNeighborsClassifier(n_neighbors=11), X=dataset[features], y=dataset[target], train_sizes=training_sizes, cv=5, scoring='neg_log_loss')
File "C:\Users\Sharanya\AppData\Local\Programs\Python\Python36-32\lib\site-packages\sklearn\model_selection\_validation.py", line 1128, in learning_curve
for train, test in train_test_proportions)
File "C:\Users\Sharanya\AppData\Local\Programs\Python\Python36-32\lib\site-packages\sklearn\externals\joblib\parallel.py", line 779, in __call__
while self.dispatch_one_batch(iterator):
File "C:\Users\Sharanya\AppData\Local\Programs\Python\Python36-32\lib\site-packages\sklearn\externals\joblib\parallel.py", line 625, in dispatch_one_batch
self._dispatch(tasks)
File "C:\Users\Sharanya\AppData\Local\Programs\Python\Python36-32\lib\site-packages\sklearn\externals\joblib\parallel.py", line 588, in _dispatch
job = self._backend.apply_async(batch, callback=cb)
File "C:\Users\Sharanya\AppData\Local\Programs\Python\Python36-32\lib\site-packages\sklearn\externals\joblib\_parallel_backends.py", line 111, in apply_async
result = ImmediateResult(func)
File "C:\Users\Sharanya\AppData\Local\Programs\Python\Python36-32\lib\site-packages\sklearn\externals\joblib\_parallel_backends.py", line 332, in __init__
self.results = batch()
File "C:\Users\Sharanya\AppData\Local\Programs\Python\Python36-32\lib\site-packages\sklearn\externals\joblib\parallel.py", line 131, in __call__
return [func(*args, **kwargs) for func, args, kwargs in self.items]
File "C:\Users\Sharanya\AppData\Local\Programs\Python\Python36-32\lib\site-packages\sklearn\externals\joblib\parallel.py", line 131, in <listcomp>
return [func(*args, **kwargs) for func, args, kwargs in self.items]
File "C:\Users\Sharanya\AppData\Local\Programs\Python\Python36-32\lib\site-packages\sklearn\model_selection\_validation.py", line 488, in _fit_and_score
test_scores = _score(estimator, X_test, y_test, scorer, is_multimetric)
File "C:\Users\Sharanya\AppData\Local\Programs\Python\Python36-32\lib\site-packages\sklearn\model_selection\_validation.py", line 528, in _score
score = scorer(estimator, X_test, y_test)
File "C:\Users\Sharanya\AppData\Local\Programs\Python\Python36-32\lib\site-packages\sklearn\metrics\scorer.py", line 138, in __call__
y_pred = clf.predict_proba(X)
File "C:\Users\Sharanya\AppData\Local\Programs\Python\Python36-32\lib\site-packages\sklearn\neighbors\classification.py", line 190, in predict_proba
neigh_dist, neigh_ind = self.kneighbors(X)
File "C:\Users\Sharanya\AppData\Local\Programs\Python\Python36-32\lib\site-packages\sklearn\neighbors\base.py", line 347, in kneighbors
(train_size, n_neighbors)
ValueError: Expected n_neighbors <= n_samples, but n_samples = 1, n_neighbors = 11
有什么问题吗?代码出了什么问题?
我应该怎么做才能消除错误?
我怀疑问题与您定义目标向量的方式有关。
尝试替换这个:
target = 'num'
有了这个:
target = ['num']
希望这对您有所帮助
你的任务是二元的。因此,当您设置 training_size=1
时,只会将单个样本传递给评分函数(在本例中为 log_loss
)。
所以 0.0
或 1.0
都会在那里。那就是错误。您需要向度量函数提供所有标签,以便它可以计算损失。
要解决这个问题,您可以做多种事情:
1) 不要像@desertnaut 所说的那样将 training_sizes
传递给 learning_curve
,让它使用默认值。在这种情况下,训练数据将被分成 5 个等距的增量部分,(在大多数情况下)将包含训练集中的标签,log_loss
将自动识别它们以计算分数。
2) 将 training_sizes 值更改为更有意义的值。也许只是从中删除 1。
training_sizes = [25,50,75,100,150,200]
我正在为你的数据工作。
3) 更改评分参数以将所有 labels
显式传递给 log_loss
。这样即使你在 training_sizes
中指定 1,log_loss
方法也知道数据有 2 个标签并相应地计算损失。
from sklearn.metrics import log_loss
# This will calculate the 'neg_log_loss' as you wanted, just with one extra param
scorer = make_scorer(log_loss, greater_is_better=False,
needs_proba=True,
labels=[0.0, 1.0]) #<== This is what you need.
然后,执行此操作:
....
....
train_size, train_scores, validation_scores = learning_curve(KNeighborsClassifier(n_neighbors=1),
X=dd[features],
y=dd[target],
train_sizes=training_sizes,
cv=5,
scoring=scorer) #<== Add that here
我收到这个错误 -
ValueError: Expected n_neighbors <= n_samples, but n_samples = 1, n_neighbors = 11
我使用的数据有 14 个属性和 303 个观察值。我希望邻居的数量为 11(任何大于 1 的值),但每次都会出现此错误。
这是我的代码-
import pandas as pd
header_names = ['age','sex','cp','trestbps','chol','fbs','restecg','thalach','exang','oldpeak','slope','ca','thal','num']
dataset = pd.read_csv('E:/HCU proj doc/EHR dataset/cleveland_cleaned_data.csv', names= header_names)
training_sizes = [1,25,50,75,100,150,200]
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import learning_curve
features = ['age','sex','cp','trestbps','chol','fbs','restecg','thalach','exang','oldpeak','slope','ca','thal']
target = 'num'
from sklearn.neighbors import KNeighborsClassifier
train_size, train_scores, validation_scores = learning_curve(estimator = KNeighborsClassifier(n_neighbors=1), X=dataset[features], y=dataset[target], train_sizes=training_sizes, cv=5, scoring='neg_log_loss')
这里是错误的回溯-
Traceback (most recent call last):
File "E:\HCU proj doc\heart_disease_scaling_and_learning_curve.py", line 15, in <module>
train_size, train_scores, validation_scores = learning_curve(estimator = KNeighborsClassifier(n_neighbors=11), X=dataset[features], y=dataset[target], train_sizes=training_sizes, cv=5, scoring='neg_log_loss')
File "C:\Users\Sharanya\AppData\Local\Programs\Python\Python36-32\lib\site-packages\sklearn\model_selection\_validation.py", line 1128, in learning_curve
for train, test in train_test_proportions)
File "C:\Users\Sharanya\AppData\Local\Programs\Python\Python36-32\lib\site-packages\sklearn\externals\joblib\parallel.py", line 779, in __call__
while self.dispatch_one_batch(iterator):
File "C:\Users\Sharanya\AppData\Local\Programs\Python\Python36-32\lib\site-packages\sklearn\externals\joblib\parallel.py", line 625, in dispatch_one_batch
self._dispatch(tasks)
File "C:\Users\Sharanya\AppData\Local\Programs\Python\Python36-32\lib\site-packages\sklearn\externals\joblib\parallel.py", line 588, in _dispatch
job = self._backend.apply_async(batch, callback=cb)
File "C:\Users\Sharanya\AppData\Local\Programs\Python\Python36-32\lib\site-packages\sklearn\externals\joblib\_parallel_backends.py", line 111, in apply_async
result = ImmediateResult(func)
File "C:\Users\Sharanya\AppData\Local\Programs\Python\Python36-32\lib\site-packages\sklearn\externals\joblib\_parallel_backends.py", line 332, in __init__
self.results = batch()
File "C:\Users\Sharanya\AppData\Local\Programs\Python\Python36-32\lib\site-packages\sklearn\externals\joblib\parallel.py", line 131, in __call__
return [func(*args, **kwargs) for func, args, kwargs in self.items]
File "C:\Users\Sharanya\AppData\Local\Programs\Python\Python36-32\lib\site-packages\sklearn\externals\joblib\parallel.py", line 131, in <listcomp>
return [func(*args, **kwargs) for func, args, kwargs in self.items]
File "C:\Users\Sharanya\AppData\Local\Programs\Python\Python36-32\lib\site-packages\sklearn\model_selection\_validation.py", line 488, in _fit_and_score
test_scores = _score(estimator, X_test, y_test, scorer, is_multimetric)
File "C:\Users\Sharanya\AppData\Local\Programs\Python\Python36-32\lib\site-packages\sklearn\model_selection\_validation.py", line 528, in _score
score = scorer(estimator, X_test, y_test)
File "C:\Users\Sharanya\AppData\Local\Programs\Python\Python36-32\lib\site-packages\sklearn\metrics\scorer.py", line 138, in __call__
y_pred = clf.predict_proba(X)
File "C:\Users\Sharanya\AppData\Local\Programs\Python\Python36-32\lib\site-packages\sklearn\neighbors\classification.py", line 190, in predict_proba
neigh_dist, neigh_ind = self.kneighbors(X)
File "C:\Users\Sharanya\AppData\Local\Programs\Python\Python36-32\lib\site-packages\sklearn\neighbors\base.py", line 347, in kneighbors
(train_size, n_neighbors)
ValueError: Expected n_neighbors <= n_samples, but n_samples = 1, n_neighbors = 11
有什么问题吗?代码出了什么问题? 我应该怎么做才能消除错误?
我怀疑问题与您定义目标向量的方式有关。 尝试替换这个:
target = 'num'
有了这个:
target = ['num']
希望这对您有所帮助
你的任务是二元的。因此,当您设置 training_size=1
时,只会将单个样本传递给评分函数(在本例中为 log_loss
)。
所以 0.0
或 1.0
都会在那里。那就是错误。您需要向度量函数提供所有标签,以便它可以计算损失。
要解决这个问题,您可以做多种事情:
1) 不要像@desertnaut 所说的那样将 training_sizes
传递给 learning_curve
,让它使用默认值。在这种情况下,训练数据将被分成 5 个等距的增量部分,(在大多数情况下)将包含训练集中的标签,log_loss
将自动识别它们以计算分数。
2) 将 training_sizes 值更改为更有意义的值。也许只是从中删除 1。
training_sizes = [25,50,75,100,150,200]
我正在为你的数据工作。
3) 更改评分参数以将所有 labels
显式传递给 log_loss
。这样即使你在 training_sizes
中指定 1,log_loss
方法也知道数据有 2 个标签并相应地计算损失。
from sklearn.metrics import log_loss
# This will calculate the 'neg_log_loss' as you wanted, just with one extra param
scorer = make_scorer(log_loss, greater_is_better=False,
needs_proba=True,
labels=[0.0, 1.0]) #<== This is what you need.
然后,执行此操作:
....
....
train_size, train_scores, validation_scores = learning_curve(KNeighborsClassifier(n_neighbors=1),
X=dd[features],
y=dd[target],
train_sizes=training_sizes,
cv=5,
scoring=scorer) #<== Add that here