分类指标无法处理二元和连续目标的混合 -Python 随机森林
classification metrics cannot handle a mix of binary and continuous targets -Python random forests
我的输入数据文件格式如下:
黄金,callersAtLeast1T,CalleesAtLeast1T,...
T,1,1,1,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0 ,0,0,0,0
N,0,0,0,0,1,1,1,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0 ,0,0,0,0
N,0,0,0,0,1,1,1,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0 ,0,0,0,0
N,0,0,0,0,1,1,1,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0 ,0,0,0,0
我正在尝试根据剩余列的值预测第一列(金色),这是我使用的代码:
import pandas as pd
import numpy as np
dataset = pd.read_csv( 'data1extended.txt', sep= ',')
#convert T into 1 and N into 0
dataset['gold'] = dataset['gold'].astype('category').cat.codes
print(dataset.head())
row_count, column_count = dataset.shape
X = dataset.iloc[:, 1:column_count].values
y = dataset.iloc[:, 0].values
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=0)
from sklearn.preprocessing import StandardScaler
sc = StandardScaler()
X_train = sc.fit_transform(X_train)
X_test = sc.transform(X_test)
from sklearn.ensemble import RandomForestRegressor
regressor = RandomForestRegressor(n_estimators=20, random_state=0)
regressor.fit(X_train, y_train)
y_pred = regressor.predict(X_test)
from sklearn.metrics import classification_report, confusion_matrix, accuracy_score
print(confusion_matrix(y_test,y_pred))
print(classification_report(y_test,y_pred))
print(accuracy_score(y_test, y_pred))
我的代码的最后 3 行导致错误,如何解决?
此行导致错误:
打印(confusion_matrix(y_test,y_pred))
我打印了 y_test 和 y_pred,这是我得到的:
y_test 是:[0 0 0 ... 0 0 0]
y_pred 是:[0.0007123 0.00402548 0.00402548 ... 0.00402548 0.02651928 0.00816086]
您使用的是 RandomForestRegressor,它输出连续值输出,即实数,而混淆矩阵需要类别值输出,即离散数输出 0、1、2 等等。
由于您要预测 类,即 1 或 0,您可以做两件事:
1.) 使用 RandomForestClassifier 而不是 RandomForestRegressor,后者将输出 0 或 1,您可以使用它来获取指标。 (推荐)
2.) 如果你只想要真正有价值的输出,你可以设置一个阈值,即
y_pred = (y_pred < threshold).astype(int)
如果数字小于阈值 else 1,这会将您的输出实数转换为 1,并将其用于获取指标。
我的输入数据文件格式如下:
黄金,callersAtLeast1T,CalleesAtLeast1T,...
T,1,1,1,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0 ,0,0,0,0
N,0,0,0,0,1,1,1,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0 ,0,0,0,0
N,0,0,0,0,1,1,1,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0 ,0,0,0,0
N,0,0,0,0,1,1,1,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0 ,0,0,0,0
我正在尝试根据剩余列的值预测第一列(金色),这是我使用的代码:
import pandas as pd
import numpy as np
dataset = pd.read_csv( 'data1extended.txt', sep= ',')
#convert T into 1 and N into 0
dataset['gold'] = dataset['gold'].astype('category').cat.codes
print(dataset.head())
row_count, column_count = dataset.shape
X = dataset.iloc[:, 1:column_count].values
y = dataset.iloc[:, 0].values
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=0)
from sklearn.preprocessing import StandardScaler
sc = StandardScaler()
X_train = sc.fit_transform(X_train)
X_test = sc.transform(X_test)
from sklearn.ensemble import RandomForestRegressor
regressor = RandomForestRegressor(n_estimators=20, random_state=0)
regressor.fit(X_train, y_train)
y_pred = regressor.predict(X_test)
from sklearn.metrics import classification_report, confusion_matrix, accuracy_score
print(confusion_matrix(y_test,y_pred))
print(classification_report(y_test,y_pred))
print(accuracy_score(y_test, y_pred))
我的代码的最后 3 行导致错误,如何解决?
此行导致错误:
打印(confusion_matrix(y_test,y_pred))
我打印了 y_test 和 y_pred,这是我得到的:
y_test 是:[0 0 0 ... 0 0 0]
y_pred 是:[0.0007123 0.00402548 0.00402548 ... 0.00402548 0.02651928 0.00816086]
您使用的是 RandomForestRegressor,它输出连续值输出,即实数,而混淆矩阵需要类别值输出,即离散数输出 0、1、2 等等。
由于您要预测 类,即 1 或 0,您可以做两件事:
1.) 使用 RandomForestClassifier 而不是 RandomForestRegressor,后者将输出 0 或 1,您可以使用它来获取指标。 (推荐)
2.) 如果你只想要真正有价值的输出,你可以设置一个阈值,即
y_pred = (y_pred < threshold).astype(int)
如果数字小于阈值 else 1,这会将您的输出实数转换为 1,并将其用于获取指标。