ValueError: Number of labels=1993 does not match number of samples=1994
ValueError: Number of labels=1993 does not match number of samples=1994
大家好,我是机器学习的新手,正在从事一个基于犯罪预测的有趣项目。我 运行 遇到一个错误,之前的错误现在已修复,但不幸的是,以下代码块返回了一个新错误。我正在使用 UCI ML Repo 上提供的数据集。我查看了类似的帖子,但没有找到任何相关的解决方案。
import pandas as pd
import numpy as np
from sklearn import tree
from sklearn.model_selection import cross_val_score
df_d=pd.read_csv('communities-crime-full.csv')
df
df['highCrime'] = np.where(df['ViolentCrimesPerPop']>0.1, 1, 0)
Y = df['highCrime']
# print('total len is ',len(Y))
initial=pd.read_csv('communities-crime-full.csv')
initial = initial.drop('communityname', 1)
initial = initial.drop('ViolentCrimesPerPop', 1)
initial = initial.drop('fold', 1)
initial = initial.drop('state', 1)
initial = initial.drop('community', 1)
initial = initial.drop('county', 1)
skipinitialspace = True
feature_name=list(initial)
#initial=initial.convert_objects(convert_numeric=True)
initial = initial.apply(pd.to_numeric, errors='coerce')
New_data=initial.fillna(initial.mean())
# print('before...')
# print(initial)
# print('after...')
# print(New_data)
clf = tree.DecisionTreeClassifier(max_depth=3)
# clf = tree.DecisionTreeClassifier()
clf = clf.fit(New_data, Y)
clf
fold=df['fold']
scores = cross_val_score(clf, New_data, Y,fold,'accuracy',10)
print('cross_val_accuracy is ',scores)
print('cross_val_accuracy_avg is ',np.array(scores).mean())
scores = cross_val_score(clf, New_data, Y,fold,'precision',10)
print('cross_val_precision is ',scores)
print('cross_val_precision_avg is ',np.array(scores).mean())
scores = cross_val_score(clf, New_data, Y,fold,'recall',10)
print('cross_val_recall is ',scores)
print('cross_val_recall_avg is ',np.array(scores).mean())
错误:
ValueError Traceback (most recent call last)
<ipython-input-15-444381be2864> in <module>()
25 clf = tree.DecisionTreeClassifier(max_depth=3)
26 # clf = tree.DecisionTreeClassifier()
---> 27 clf = clf.fit(New_data, Y)
28 clf
29 fold=df['fold']
/root/.local/lib/python3.7/site-packages/sklearn/tree/_classes.py in fit(self, X, y, sample_weight, check_input, X_idx_sorted)
281 if len(y) != n_samples:
282 raise ValueError("Number of labels=%d does not match "
--> 283 "number of samples=%d" % (len(y), n_samples))
284 if not 0 <= self.min_weight_fraction_leaf <= 0.5:
285 raise ValueError("min_weight_fraction_leaf must in [0, 0.5]")
ValueError: Number of labels=1993 does not match number of samples=1994
该错误表明您的“标签”比“样本”多了一个。
这意味着您还有一个输入可以输出。
不过,我认为这实际上不是你的问题。您似乎不小心使用了之前加载到 ram 中的数据,它的尺寸有问题。
在您的代码中有:
df_d=pd.read_csv('communities-crime-full.csv')
应该是:
df=pd.read_csv('communities-crime-full.csv')
结果代码将是:
import pandas as pd
import numpy as np
from sklearn import tree
from sklearn.model_selection import cross_val_score
df=pd.read_csv('communities-crime-full.csv')
df['highCrime'] = np.where(df['ViolentCrimesPerPop']>0.1, 1, 0)
Y = df['highCrime']
# print('total len is ',len(Y))
initial=pd.read_csv('communities-crime-full.csv')
initial = initial.drop('communityname', 1)
initial = initial.drop('ViolentCrimesPerPop', 1)
initial = initial.drop('fold', 1)
initial = initial.drop('state', 1)
initial = initial.drop('community', 1)
initial = initial.drop('county', 1)
skipinitialspace = True
feature_name=list(initial)
#initial=initial.convert_objects(convert_numeric=True)
initial = initial.apply(pd.to_numeric, errors='coerce')
New_data=initial.fillna(initial.mean())
# print('before...')
# print(initial)
# print('after...')
# print(New_data)
clf = tree.DecisionTreeClassifier(max_depth=3)
# clf = tree.DecisionTreeClassifier()
clf = clf.fit(New_data, Y)
clf
fold=df['fold']
scores = cross_val_score(clf, New_data, Y,fold,'accuracy',10)
print('cross_val_accuracy is ',scores)
print('cross_val_accuracy_avg is ',np.array(scores).mean())
scores = cross_val_score(clf, New_data, Y,fold,'precision',10)
print('cross_val_precision is ',scores)
print('cross_val_precision_avg is ',np.array(scores).mean())
scores = cross_val_score(clf, New_data, Y,fold,'recall',10)
print('cross_val_recall is ',scores)
print('cross_val_recall_avg is ',np.array(scores).mean())
这导致:
cross_val_accuracy is [0.81 0.825 0.805 0.8 0.82914573 0.77386935
0.85427136 0.83417085 0.80904523 0.8040201 ]
cross_val_accuracy_avg is 0.8144522613065327
cross_val_precision is [0.90740741 0.86290323 0.84677419 0.84 0.85826772 0.85714286
0.92105263 0.92592593 0.85950413 0.90566038]
cross_val_precision_avg is 0.8784638467535306
cross_val_recall is [0.77777778 0.856 0.84 0.84 0.872 0.768
0.84 0.8 0.832 0.768 ]
cross_val_recall_avg is 0.8193777777777778
看起来确实在学习一些东西!
大家好,我是机器学习的新手,正在从事一个基于犯罪预测的有趣项目。我 运行 遇到一个错误,之前的错误现在已修复,但不幸的是,以下代码块返回了一个新错误。我正在使用 UCI ML Repo 上提供的数据集。我查看了类似的帖子,但没有找到任何相关的解决方案。
import pandas as pd
import numpy as np
from sklearn import tree
from sklearn.model_selection import cross_val_score
df_d=pd.read_csv('communities-crime-full.csv')
df
df['highCrime'] = np.where(df['ViolentCrimesPerPop']>0.1, 1, 0)
Y = df['highCrime']
# print('total len is ',len(Y))
initial=pd.read_csv('communities-crime-full.csv')
initial = initial.drop('communityname', 1)
initial = initial.drop('ViolentCrimesPerPop', 1)
initial = initial.drop('fold', 1)
initial = initial.drop('state', 1)
initial = initial.drop('community', 1)
initial = initial.drop('county', 1)
skipinitialspace = True
feature_name=list(initial)
#initial=initial.convert_objects(convert_numeric=True)
initial = initial.apply(pd.to_numeric, errors='coerce')
New_data=initial.fillna(initial.mean())
# print('before...')
# print(initial)
# print('after...')
# print(New_data)
clf = tree.DecisionTreeClassifier(max_depth=3)
# clf = tree.DecisionTreeClassifier()
clf = clf.fit(New_data, Y)
clf
fold=df['fold']
scores = cross_val_score(clf, New_data, Y,fold,'accuracy',10)
print('cross_val_accuracy is ',scores)
print('cross_val_accuracy_avg is ',np.array(scores).mean())
scores = cross_val_score(clf, New_data, Y,fold,'precision',10)
print('cross_val_precision is ',scores)
print('cross_val_precision_avg is ',np.array(scores).mean())
scores = cross_val_score(clf, New_data, Y,fold,'recall',10)
print('cross_val_recall is ',scores)
print('cross_val_recall_avg is ',np.array(scores).mean())
错误:
ValueError Traceback (most recent call last)
<ipython-input-15-444381be2864> in <module>()
25 clf = tree.DecisionTreeClassifier(max_depth=3)
26 # clf = tree.DecisionTreeClassifier()
---> 27 clf = clf.fit(New_data, Y)
28 clf
29 fold=df['fold']
/root/.local/lib/python3.7/site-packages/sklearn/tree/_classes.py in fit(self, X, y, sample_weight, check_input, X_idx_sorted)
281 if len(y) != n_samples:
282 raise ValueError("Number of labels=%d does not match "
--> 283 "number of samples=%d" % (len(y), n_samples))
284 if not 0 <= self.min_weight_fraction_leaf <= 0.5:
285 raise ValueError("min_weight_fraction_leaf must in [0, 0.5]")
ValueError: Number of labels=1993 does not match number of samples=1994
该错误表明您的“标签”比“样本”多了一个。 这意味着您还有一个输入可以输出。
不过,我认为这实际上不是你的问题。您似乎不小心使用了之前加载到 ram 中的数据,它的尺寸有问题。
在您的代码中有:
df_d=pd.read_csv('communities-crime-full.csv')
应该是:
df=pd.read_csv('communities-crime-full.csv')
结果代码将是:
import pandas as pd
import numpy as np
from sklearn import tree
from sklearn.model_selection import cross_val_score
df=pd.read_csv('communities-crime-full.csv')
df['highCrime'] = np.where(df['ViolentCrimesPerPop']>0.1, 1, 0)
Y = df['highCrime']
# print('total len is ',len(Y))
initial=pd.read_csv('communities-crime-full.csv')
initial = initial.drop('communityname', 1)
initial = initial.drop('ViolentCrimesPerPop', 1)
initial = initial.drop('fold', 1)
initial = initial.drop('state', 1)
initial = initial.drop('community', 1)
initial = initial.drop('county', 1)
skipinitialspace = True
feature_name=list(initial)
#initial=initial.convert_objects(convert_numeric=True)
initial = initial.apply(pd.to_numeric, errors='coerce')
New_data=initial.fillna(initial.mean())
# print('before...')
# print(initial)
# print('after...')
# print(New_data)
clf = tree.DecisionTreeClassifier(max_depth=3)
# clf = tree.DecisionTreeClassifier()
clf = clf.fit(New_data, Y)
clf
fold=df['fold']
scores = cross_val_score(clf, New_data, Y,fold,'accuracy',10)
print('cross_val_accuracy is ',scores)
print('cross_val_accuracy_avg is ',np.array(scores).mean())
scores = cross_val_score(clf, New_data, Y,fold,'precision',10)
print('cross_val_precision is ',scores)
print('cross_val_precision_avg is ',np.array(scores).mean())
scores = cross_val_score(clf, New_data, Y,fold,'recall',10)
print('cross_val_recall is ',scores)
print('cross_val_recall_avg is ',np.array(scores).mean())
这导致:
cross_val_accuracy is [0.81 0.825 0.805 0.8 0.82914573 0.77386935
0.85427136 0.83417085 0.80904523 0.8040201 ]
cross_val_accuracy_avg is 0.8144522613065327
cross_val_precision is [0.90740741 0.86290323 0.84677419 0.84 0.85826772 0.85714286
0.92105263 0.92592593 0.85950413 0.90566038]
cross_val_precision_avg is 0.8784638467535306
cross_val_recall is [0.77777778 0.856 0.84 0.84 0.872 0.768
0.84 0.8 0.832 0.768 ]
cross_val_recall_avg is 0.8193777777777778
看起来确实在学习一些东西!