如何改进我的回归模型,使随机森林回归更准确
How to improve my regression models results more accurate in random forest regression
问题:使 r2 接近 0.64。想进一步提高我的成绩。不知道这些结果有什么问题。已完成移除异常值、转换字符串 -> 数值、规范化。想知道我的输出有什么问题吗?如果我没有正确提出问题,请问我任何问题。这只是我在 Stack overflow 上的开始。
y.value_counts()
3.3 215
3.0 185
2.7 154
3.7 134
2.3 96
4.0 54
2.0 31
1.7 21
1.3 20
这是我输出的直方图。我不是回归专业的,需要你的超级帮助。
消除输入中的共线性
import seaborn as sns
# data=z_scores(df)
data=df
correlation=data.corr()
k=22
cols=correlation.nlargest(k,'Please enter your Subjects GPA which you have studied? (CS) [Introduction to ICT]')['Please enter your Subjects GPA which you have studied? (CS) [Introduction to ICT]'].index
cm=np.corrcoef(data[cols].values.T)
f,ax=plt.subplots(figsize=(15,15))
sns.heatmap(cm,vmax=.8,linewidths=0.01,square=True,annot=True,cmap='viridis',
linecolor="white",xticklabels=cols.values,annot_kws={'size':12},yticklabels=cols.values)
cols=pd.DataFrame(cols)
cols=cols.set_axis(["Selected Features"], axis=1)
cols=cols[cols['Selected Features'] != 'Please enter your Subjects GPA which you have studied? (CS) [Introduction to ICT]']
cols=cols[cols['Selected Features'] != 'Your Fsc/Ics marks percentage?']
X=df[cols['Selected Features'].tolist()]
X
然后应用随机森林回归得到这些结果
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.1, random_state=42)
from sklearn.ensemble import RandomForestRegressor
regressor = RandomForestRegressor(n_estimators = 10, random_state = 0)
model=regressor.fit(X_train, y_train)
y_pred = model.predict(X_test)
print("MAE Score: ", mean_absolute_error(y_test, y_pred))
print("MSE Score: ", mean_squared_error(y_test, y_pred))
print("RMSE Score: ", math.sqrt(mean_squared_error(y_test, y_pred)))
print("R2 score : %.2f" %r2_score(y_test,y_pred))
得到这些结果。
MAE Score: 0.252967032967033
MSE Score: 0.13469450549450546
RMSE Score: 0.36700750059706605
R2 score : 0.64
为了获得更好的结果,您需要进行 hyper-parameter 调整,尝试着重于这些
-
n_estimators = number of trees in the forest
max_features = max number of features considered for splitting a node
max_depth = max number of levels in each decision tree
min_samples_split = min number of data points placed in a node before the node is split
min_samples_leaf = min number of data points allowed in a leaf node
bootstrap = method for sampling data points (with or without replacement)
-
Parameters currently in use(random forest regressor )
{'bootstrap': True,
'criterion': 'mse',
'max_depth': None,
'max_features': 'auto',
'max_leaf_nodes': None,
'min_impurity_decrease': 0.0,
'min_impurity_split': None,
'min_samples_leaf': 1,
'min_samples_split': 2,
'min_weight_fraction_leaf': 0.0,
'n_estimators': 10,
'n_jobs': 1,
'oob_score': False,
'random_state': 42,
'verbose': 0,
'warm_start': False}
k 折交叉验证
使用网格搜索简历
问题:使 r2 接近 0.64。想进一步提高我的成绩。不知道这些结果有什么问题。已完成移除异常值、转换字符串 -> 数值、规范化。想知道我的输出有什么问题吗?如果我没有正确提出问题,请问我任何问题。这只是我在 Stack overflow 上的开始。
y.value_counts()
3.3 215
3.0 185
2.7 154
3.7 134
2.3 96
4.0 54
2.0 31
1.7 21
1.3 20
这是我输出的直方图。我不是回归专业的,需要你的超级帮助。
消除输入中的共线性
import seaborn as sns
# data=z_scores(df)
data=df
correlation=data.corr()
k=22
cols=correlation.nlargest(k,'Please enter your Subjects GPA which you have studied? (CS) [Introduction to ICT]')['Please enter your Subjects GPA which you have studied? (CS) [Introduction to ICT]'].index
cm=np.corrcoef(data[cols].values.T)
f,ax=plt.subplots(figsize=(15,15))
sns.heatmap(cm,vmax=.8,linewidths=0.01,square=True,annot=True,cmap='viridis',
linecolor="white",xticklabels=cols.values,annot_kws={'size':12},yticklabels=cols.values)
cols=pd.DataFrame(cols)
cols=cols.set_axis(["Selected Features"], axis=1)
cols=cols[cols['Selected Features'] != 'Please enter your Subjects GPA which you have studied? (CS) [Introduction to ICT]']
cols=cols[cols['Selected Features'] != 'Your Fsc/Ics marks percentage?']
X=df[cols['Selected Features'].tolist()]
X
然后应用随机森林回归得到这些结果
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.1, random_state=42)
from sklearn.ensemble import RandomForestRegressor
regressor = RandomForestRegressor(n_estimators = 10, random_state = 0)
model=regressor.fit(X_train, y_train)
y_pred = model.predict(X_test)
print("MAE Score: ", mean_absolute_error(y_test, y_pred))
print("MSE Score: ", mean_squared_error(y_test, y_pred))
print("RMSE Score: ", math.sqrt(mean_squared_error(y_test, y_pred)))
print("R2 score : %.2f" %r2_score(y_test,y_pred))
得到这些结果。
MAE Score: 0.252967032967033
MSE Score: 0.13469450549450546
RMSE Score: 0.36700750059706605
R2 score : 0.64
为了获得更好的结果,您需要进行 hyper-parameter 调整,尝试着重于这些
-
n_estimators = number of trees in the forest max_features = max number of features considered for splitting a node max_depth = max number of levels in each decision tree min_samples_split = min number of data points placed in a node before the node is split min_samples_leaf = min number of data points allowed in a leaf node bootstrap = method for sampling data points (with or without replacement)
-
Parameters currently in use(random forest regressor ) {'bootstrap': True, 'criterion': 'mse', 'max_depth': None, 'max_features': 'auto', 'max_leaf_nodes': None, 'min_impurity_decrease': 0.0, 'min_impurity_split': None, 'min_samples_leaf': 1, 'min_samples_split': 2, 'min_weight_fraction_leaf': 0.0, 'n_estimators': 10, 'n_jobs': 1, 'oob_score': False, 'random_state': 42, 'verbose': 0, 'warm_start': False}
k 折交叉验证
使用网格搜索简历