训练集和测试集的随机森林回归精度不同
Random Forest Regression Accuracy different for Training set and Test set
我是机器学习和 Python 的新手。我正在尝试在 UCI 存储库中的一个数据集上构建随机森林回归模型。这是我的第一个机器学习模型。我的方法可能完全错误。
数据集可在此处获得 - https://archive.ics.uci.edu/ml/datasets/abalone
下面是我编写的完整工作代码。我正在使用 Python 3.6.4 和 Windows 7 x64 OS(请原谅我冗长的代码)。
import tkinter as tk # Required for enabling GUI options
from tkinter import messagebox # Required for pop-up window
from tkinter import filedialog # Required for getting full path of file
import pandas as pd # Required for data handling
from sklearn.model_selection import train_test_split # Required for splitting data into training and test set
from sklearn.ensemble import RandomForestRegressor # Required to build random forest
#------------------------------------------------------------------------------------------------------------------------#
# Create an instance of tkinter and hide the window
root = tk.Tk() # Create an instance of tkinter
root.withdraw() # Hides root window
#root.lift() # Required for pop-up window management
root.attributes("-topmost", True) # To make pop-up window stay on top of all other windows
#------------------------------------------------------------------------------------------------------------------------#
# This block of code reads input file using tkinter GUI options
print("Reading input file...")
# Pop up window to ask user the input file
File_Checker = messagebox.askokcancel("Random Forest Regression Prompt",
"At The Prompt, Enter 'Abalone_Data.csv' File.")
# Kill the execution if user selects "Cancel" in the above pop-up window
if (File_Checker == False):
quit()
else:
del(File_Checker)
file_loop = 0
while (file_loop == 0):
# Get path of base file
file_path = filedialog.askopenfilename(initialdir = "/",
title = "File Selection Prompt",
filetypes = (("XLSX Files","*.*"), ))
# Condition to check if user selected a file or not
if (len(file_path) < 1):
# Pop-up window to warn uer that no file was selected
result = messagebox.askretrycancel("File Selection Prompt Error",
"No file has been selected. \nWhat do you want to do?")
# Condition to repeat the loop or quit program execution
if (result == True):
continue
else:
quit()
# Get file name
file_name = file_path.split("/") # Splits the file with "/" as the delimiter and returns a list
file_name = file_name[-1] # extracts the last element of the list
# Condition to check if correct file was selected or not
if (file_name != "Abalone_Data.csv"):
result = messagebox.askretrycancel("File Selection Prompt Error",
"Incorrect file selected. \nWhat do you want to do?")
# Condition to repeat the loop or quit program execution
if (result == True):
continue
else:
quit()
# Read the base file
input_file = pd.read_csv(file_path,
sep = ',',
encoding = 'utf-8',
low_memory = True)
break
# Delete unwanted files
del(file_loop, file_name)
#------------------------------------------------------------------------------------------------------------------------#
print("Preparing dependent and independent variables...")
# Create Separate dataframe consisting of only dependent variable
y = pd.DataFrame(input_file['Rings'])
# Create Separate dataframe consisting of only independent variable
X = input_file.drop(columns = ['Rings'], inplace = False, axis = 1)
#------------------------------------------------------------------------------------------------------------------------#
print("Handling Dummy Variable Trap...")
# Create a new dataframe to handle categorical data
# This method splits the dategorical data column into separate columns
# This is to ensure we get rid of the dummy variable trap
dummy_Sex = pd.get_dummies(X['Sex'], prefix = 'Sex', prefix_sep = '_', drop_first = True)
# Remove the speciic columns from the dataframe
# These are the categorical data columns which split into separae columns in the previous step
X.drop(columns = ['Sex'], inplace = True, axis = 1)
# Merge the new columns to the original dataframe
X = pd.concat([X, dummy_sex], axis = 1)
#------------------------------------------------------------------------------------------------------------------------#
y = y.values
X = X.values
#------------------------------------------------------------------------------------------------------------------------#
print("Splitting datasets to training and test sets...")
# Splitting the data into training set and test set
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state = 0)
#------------------------------------------------------------------------------------------------------------------------#
print("Fitting Random Forest regression on training set")
# Fitting the regression model to the dataset
regressor = RandomForestRegressor(n_estimators = 100, random_state = 50)
regressor.fit(X_train, y_train.ravel()) # Using ravel() to avoid getting 'DataConversionWarning' warning message
#------------------------------------------------------------------------------------------------------------------------#
print("Predicting Values")
# Predicting a new result with regression
y_pred = regressor.predict(X_test)
# Enter values for new prediction as a Dictionary
test_values = {'Sex_I' : 0,
'Sex_M' : 0,
'Length' : 0.5,
'Diameter' : 0.35,
'Height' : 0.8,
'Whole_Weight' : 0.223,
'Shucked_Weight' : 0.09,
'Viscera_Weight' : 0.05,
'Shell_Weight' : 0.07}
# Convert dictionary into dataframe
test_values = pd.DataFrame(test_values, index = [0])
# Rearranging columns as required
test_values = test_values[['Length','Diameter','Height','Whole_Weight','Shucked_Weight','Viscera_Weight',
'Viscera_Weight', 'Sex_I', 'Sex_M']]
# Applying feature scaling
#test_values = sc_X.transform(test_values)
# Predicting values of new data
new_pred = regressor.predict(test_values)
#------------------------------------------------------------------------------------------------------------------------#
"""
print("Building Confusion Matrix...")
# Making the confusion matrix
cm = confusion_matrix(y_test, y_pred)
"""
#------------------------------------------------------------------------------------------------------------------------#
print("\n")
print("Getting Model Accuracy...")
# Get regression details
#print("Estimated Coefficient = ", regressor.coef_)
#print("Estimated Intercept = ", regressor.intercept_)
print("Training Accuracy = ", regressor.score(X_train, y_train))
print("Test Accuracy = ", regressor.score(X_test, y_test))
print("\n")
print("Printing predicted result...")
print("Result_of_Treatment = ", new_pred)
当我查看模型准确度时,我得到以下结果。
Getting Model Accuracy...
Training Accuracy = 0.9359702279804791
Test Accuracy = 0.5695080680053354
以下是我的问题。
1)为什么Training Accuracy
和Test Accuracy
这么远?
2) 我怎么知道这个模型是否 over/under 适合?
3) 随机森林回归是正确的模型吗?如果不是,我如何确定此用例的正确模型?
3) 如何使用我创建的变量构建混淆矩阵?
4) 如何验证模型的性能?
我正在寻求您的指导,以便我也能从错误中吸取教训并提高我的建模技能。
使用 Trees 和 Ensemble,您必须注意一些设置。在您的情况下,差异来自 "overfitting"。这意味着,您的模型已经学习了 "too much" 您的训练数据,无法推广到其他数据。
一件重要的事情是限制树木的深度。对于每棵树,都有一个 2 的分支因子。这意味着在深度 d 处,你将有 2^d 个分支。
Let's imagine you have 1000 training values. If you don't limit depth
(or/and min_samples_leaf), you can learn your complete dataset with a
depth of 10 (because 2^10 = 1024 > N_training).
您可以做的是比较一定深度范围内的训练准确度和测试准确度(假设从 3 到 log(n) in base 2)。如果深度太低,两者的准确性都会很低,因为你需要更多的分支来正确地学习数据,它会上升到一个峰值,然后训练数据会继续上升,但测试值会下降。它应该类似于下图,模型复杂度是您的深度。
你也可以玩 min_samples_split and/or min_samples_leaf 这可以帮助你假设仅当您在此分支中有多个数据时才拆分。结果,这也将限制深度,并允许每个分支具有不同深度的树。与前面解释的一样,您可以使用值来寻找最佳值(使用网格搜索)。
希望对你有帮助,
在尝试回答您的观点之前,请发表评论:我看到您正在使用以准确性为指标的回归量。但是准确性是分类问题中使用的度量标准;在回归模型中,您通常使用其他指标,如均方误差 (MSE)。参见 here。
如果你只是切换到一个更适应的指标,也许你会发现你的模型并没有那么糟糕。
无论如何我都会回答你的问题。
为什么训练准确率和测试准确率相差这么远?
这意味着您过度拟合了训练样本:您的模型在预测训练数据集的数据方面非常强大,但无法泛化。就像在一组猫图片上训练模型一样,它认为只有这些图片是猫,而所有其他猫的所有其他图片都不是。事实上,你在测试集上的准确率是~0.5,这基本上是一个随机猜测。
我怎么知道这个模型是否over/under适合?
准确地形成两组之间的精度差异。它们彼此越接近,模型就越能泛化。你已经知道过度拟合的样子。欠拟合通常是可识别的,因为两组的准确度都很低。
随机森林回归是正确的模型吗?如果不是,我如何确定此 use-case 的正确型号?
没有合适的模型可以使用。随机森林,以及通常所有 tree-based 模型(LightGBM、XGBoost)都是处理结构化数据时机器学习的瑞士军刀,因为它们的简单性和可靠性。基于深度学习的模型在理论上表现更好,但设置起来要复杂得多。
如何使用我创建的变量构建混淆矩阵?
混淆矩阵可以在构建分类模型时创建,并基于模型的输出构建。
你用的是回归量,没有多大意义。
如何验证模型的性能?
一般来说,为了对性能进行可靠的验证,你将数据分成三部分:你在一个(a.k.a。训练集)上训练,在第二个(a.k.a。验证集,这就是你称之为测试集),最后,当你对模型及其 hyper-parameters 感到满意时,你在第三个(a.k.a. 测试集上测试它,不要与你称之为测试的那个混淆放)。最后一个告诉你你的模型是否概括得很好。这是因为当您选择和调整模型时,您也可以过拟合验证集(您称为测试集的那个),也许选择一组 hyper-parameters 只在该集上表现良好。
此外,您必须选择一个可靠的指标,这取决于数据和模型。通过回归,MSE 非常好。
我是机器学习和 Python 的新手。我正在尝试在 UCI 存储库中的一个数据集上构建随机森林回归模型。这是我的第一个机器学习模型。我的方法可能完全错误。
数据集可在此处获得 - https://archive.ics.uci.edu/ml/datasets/abalone
下面是我编写的完整工作代码。我正在使用 Python 3.6.4 和 Windows 7 x64 OS(请原谅我冗长的代码)。
import tkinter as tk # Required for enabling GUI options
from tkinter import messagebox # Required for pop-up window
from tkinter import filedialog # Required for getting full path of file
import pandas as pd # Required for data handling
from sklearn.model_selection import train_test_split # Required for splitting data into training and test set
from sklearn.ensemble import RandomForestRegressor # Required to build random forest
#------------------------------------------------------------------------------------------------------------------------#
# Create an instance of tkinter and hide the window
root = tk.Tk() # Create an instance of tkinter
root.withdraw() # Hides root window
#root.lift() # Required for pop-up window management
root.attributes("-topmost", True) # To make pop-up window stay on top of all other windows
#------------------------------------------------------------------------------------------------------------------------#
# This block of code reads input file using tkinter GUI options
print("Reading input file...")
# Pop up window to ask user the input file
File_Checker = messagebox.askokcancel("Random Forest Regression Prompt",
"At The Prompt, Enter 'Abalone_Data.csv' File.")
# Kill the execution if user selects "Cancel" in the above pop-up window
if (File_Checker == False):
quit()
else:
del(File_Checker)
file_loop = 0
while (file_loop == 0):
# Get path of base file
file_path = filedialog.askopenfilename(initialdir = "/",
title = "File Selection Prompt",
filetypes = (("XLSX Files","*.*"), ))
# Condition to check if user selected a file or not
if (len(file_path) < 1):
# Pop-up window to warn uer that no file was selected
result = messagebox.askretrycancel("File Selection Prompt Error",
"No file has been selected. \nWhat do you want to do?")
# Condition to repeat the loop or quit program execution
if (result == True):
continue
else:
quit()
# Get file name
file_name = file_path.split("/") # Splits the file with "/" as the delimiter and returns a list
file_name = file_name[-1] # extracts the last element of the list
# Condition to check if correct file was selected or not
if (file_name != "Abalone_Data.csv"):
result = messagebox.askretrycancel("File Selection Prompt Error",
"Incorrect file selected. \nWhat do you want to do?")
# Condition to repeat the loop or quit program execution
if (result == True):
continue
else:
quit()
# Read the base file
input_file = pd.read_csv(file_path,
sep = ',',
encoding = 'utf-8',
low_memory = True)
break
# Delete unwanted files
del(file_loop, file_name)
#------------------------------------------------------------------------------------------------------------------------#
print("Preparing dependent and independent variables...")
# Create Separate dataframe consisting of only dependent variable
y = pd.DataFrame(input_file['Rings'])
# Create Separate dataframe consisting of only independent variable
X = input_file.drop(columns = ['Rings'], inplace = False, axis = 1)
#------------------------------------------------------------------------------------------------------------------------#
print("Handling Dummy Variable Trap...")
# Create a new dataframe to handle categorical data
# This method splits the dategorical data column into separate columns
# This is to ensure we get rid of the dummy variable trap
dummy_Sex = pd.get_dummies(X['Sex'], prefix = 'Sex', prefix_sep = '_', drop_first = True)
# Remove the speciic columns from the dataframe
# These are the categorical data columns which split into separae columns in the previous step
X.drop(columns = ['Sex'], inplace = True, axis = 1)
# Merge the new columns to the original dataframe
X = pd.concat([X, dummy_sex], axis = 1)
#------------------------------------------------------------------------------------------------------------------------#
y = y.values
X = X.values
#------------------------------------------------------------------------------------------------------------------------#
print("Splitting datasets to training and test sets...")
# Splitting the data into training set and test set
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state = 0)
#------------------------------------------------------------------------------------------------------------------------#
print("Fitting Random Forest regression on training set")
# Fitting the regression model to the dataset
regressor = RandomForestRegressor(n_estimators = 100, random_state = 50)
regressor.fit(X_train, y_train.ravel()) # Using ravel() to avoid getting 'DataConversionWarning' warning message
#------------------------------------------------------------------------------------------------------------------------#
print("Predicting Values")
# Predicting a new result with regression
y_pred = regressor.predict(X_test)
# Enter values for new prediction as a Dictionary
test_values = {'Sex_I' : 0,
'Sex_M' : 0,
'Length' : 0.5,
'Diameter' : 0.35,
'Height' : 0.8,
'Whole_Weight' : 0.223,
'Shucked_Weight' : 0.09,
'Viscera_Weight' : 0.05,
'Shell_Weight' : 0.07}
# Convert dictionary into dataframe
test_values = pd.DataFrame(test_values, index = [0])
# Rearranging columns as required
test_values = test_values[['Length','Diameter','Height','Whole_Weight','Shucked_Weight','Viscera_Weight',
'Viscera_Weight', 'Sex_I', 'Sex_M']]
# Applying feature scaling
#test_values = sc_X.transform(test_values)
# Predicting values of new data
new_pred = regressor.predict(test_values)
#------------------------------------------------------------------------------------------------------------------------#
"""
print("Building Confusion Matrix...")
# Making the confusion matrix
cm = confusion_matrix(y_test, y_pred)
"""
#------------------------------------------------------------------------------------------------------------------------#
print("\n")
print("Getting Model Accuracy...")
# Get regression details
#print("Estimated Coefficient = ", regressor.coef_)
#print("Estimated Intercept = ", regressor.intercept_)
print("Training Accuracy = ", regressor.score(X_train, y_train))
print("Test Accuracy = ", regressor.score(X_test, y_test))
print("\n")
print("Printing predicted result...")
print("Result_of_Treatment = ", new_pred)
当我查看模型准确度时,我得到以下结果。
Getting Model Accuracy...
Training Accuracy = 0.9359702279804791
Test Accuracy = 0.5695080680053354
以下是我的问题。
1)为什么Training Accuracy
和Test Accuracy
这么远?
2) 我怎么知道这个模型是否 over/under 适合?
3) 随机森林回归是正确的模型吗?如果不是,我如何确定此用例的正确模型?
3) 如何使用我创建的变量构建混淆矩阵?
4) 如何验证模型的性能?
我正在寻求您的指导,以便我也能从错误中吸取教训并提高我的建模技能。
使用 Trees 和 Ensemble,您必须注意一些设置。在您的情况下,差异来自 "overfitting"。这意味着,您的模型已经学习了 "too much" 您的训练数据,无法推广到其他数据。
一件重要的事情是限制树木的深度。对于每棵树,都有一个 2 的分支因子。这意味着在深度 d 处,你将有 2^d 个分支。
Let's imagine you have 1000 training values. If you don't limit depth (or/and min_samples_leaf), you can learn your complete dataset with a depth of 10 (because 2^10 = 1024 > N_training).
您可以做的是比较一定深度范围内的训练准确度和测试准确度(假设从 3 到 log(n) in base 2)。如果深度太低,两者的准确性都会很低,因为你需要更多的分支来正确地学习数据,它会上升到一个峰值,然后训练数据会继续上升,但测试值会下降。它应该类似于下图,模型复杂度是您的深度。
你也可以玩 min_samples_split and/or min_samples_leaf 这可以帮助你假设仅当您在此分支中有多个数据时才拆分。结果,这也将限制深度,并允许每个分支具有不同深度的树。与前面解释的一样,您可以使用值来寻找最佳值(使用网格搜索)。
希望对你有帮助,
在尝试回答您的观点之前,请发表评论:我看到您正在使用以准确性为指标的回归量。但是准确性是分类问题中使用的度量标准;在回归模型中,您通常使用其他指标,如均方误差 (MSE)。参见 here。
如果你只是切换到一个更适应的指标,也许你会发现你的模型并没有那么糟糕。
无论如何我都会回答你的问题。
为什么训练准确率和测试准确率相差这么远? 这意味着您过度拟合了训练样本:您的模型在预测训练数据集的数据方面非常强大,但无法泛化。就像在一组猫图片上训练模型一样,它认为只有这些图片是猫,而所有其他猫的所有其他图片都不是。事实上,你在测试集上的准确率是~0.5,这基本上是一个随机猜测。
我怎么知道这个模型是否over/under适合? 准确地形成两组之间的精度差异。它们彼此越接近,模型就越能泛化。你已经知道过度拟合的样子。欠拟合通常是可识别的,因为两组的准确度都很低。
随机森林回归是正确的模型吗?如果不是,我如何确定此 use-case 的正确型号? 没有合适的模型可以使用。随机森林,以及通常所有 tree-based 模型(LightGBM、XGBoost)都是处理结构化数据时机器学习的瑞士军刀,因为它们的简单性和可靠性。基于深度学习的模型在理论上表现更好,但设置起来要复杂得多。
如何使用我创建的变量构建混淆矩阵? 混淆矩阵可以在构建分类模型时创建,并基于模型的输出构建。 你用的是回归量,没有多大意义。
如何验证模型的性能? 一般来说,为了对性能进行可靠的验证,你将数据分成三部分:你在一个(a.k.a。训练集)上训练,在第二个(a.k.a。验证集,这就是你称之为测试集),最后,当你对模型及其 hyper-parameters 感到满意时,你在第三个(a.k.a. 测试集上测试它,不要与你称之为测试的那个混淆放)。最后一个告诉你你的模型是否概括得很好。这是因为当您选择和调整模型时,您也可以过拟合验证集(您称为测试集的那个),也许选择一组 hyper-parameters 只在该集上表现良好。 此外,您必须选择一个可靠的指标,这取决于数据和模型。通过回归,MSE 非常好。