为什么 DecisionTreeClassifier 给出 100% 的结果?
Why DecisionTreeClassifier gives 100% result?
我做了以下事情:
- 拆分测试和训练数据。
- 确保测试数据和训练数据之间没有任何共同点。
- 进行缩放以使训练数据具有相同数量的“是”和“否”。
但是,我总是得到一个1.0的最佳参数。这是为什么?
这是完整的代码:
from sklearn.tree import DecisionTreeClassifier
from random import randrange
import numpy as np
import seaborn as sns
import pandas as pd
import pandas.util.testing as tm
import matplotlib.pyplot as plt
from sklearn import preprocessing
参考(用于转换文本>数字):
https://www.kaggle.com/pavansubhasht/ibm-hr-analytics-attrition-dataset/discussion/86957
url = "https://raw.githubusercontent.com/furkan-ozbudak/machine-learning/master/input.csv"
# Import data
dataFrame = pd.read_csv(url)
# Drop non-priority features/columns
dataFrame = dataFrame.drop(columns=['Education', 'EmployeeCount', 'NumCompaniesWorked', 'Over18'])
features = [
'Attrition',
'BusinessTravel',
'Department',
'EducationField',
'Gender',
'JobRole',
'MaritalStatus',
'OverTime'
]
stringToNumericDict = {
"Yes":1, "No":0, "Y":1, "N":0,
"Non-Travel":0, "Travel_Frequently":2, "Travel_Rarely": 3,
"Research & Development": 2, "Human Resources":"1", "Sales": 3,
"Life Sciences": 2, "Medical":4, "Other":5, "Marketing": 3, "Technical Degree":6,
"Male": 2, "Female":1,
"Laboratory Technician": 3, "Healthcare Representative": 1, "Manufacturing Director":5,
"Sales Executive": 8, "Research Scientist": 7, "Research Director": 6,"Sales Representative": 9,
"Manager": 4,
"Married": 2, "Divorced": 1, "Single": 3,
}
# Convert Alphabets > Numeric
for feature in features:
dataFrame[feature].replace(stringToNumericDict, inplace=True)
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
y = dataFrame['Attrition']
在这里,我正在拆分我的数据以进行测试和训练。
X_train, X_test, y_train, y_test = train_test_split(dataFrame, y, test_size=0.3, random_state=1)
在这里,我对火车数据进行上采样,因为它有多数。在这一步之后,是和否都是 50%。
from sklearn.utils import resample
df_majority = X_train[X_train['Attrition']==0] # 0 = No
df_minority = X_train[X_train['Attrition']==1]
print("Count of 'No': %d(majority), Count of Yes: %d(minority)" % (len(df_majority), len(df_minority)))
# Upsample minority class
df_minority_upsampled = resample(df_minority,
replace=True, # sample with replacement
n_samples=869, # to match majority class
random_state=50) # reproducible results
# Combine majority class with upsampled minority class
X_train = pd.concat([df_majority, df_minority_upsampled])
# Display new class counts
X_train['Attrition'].value_counts()
# Change y_train in because X_train changed
y_train = X_train['Attrition'].values
sns.countplot(X_train['Attrition'])
all_cols = list(X_train.columns)
X_train.merge(X_test.drop_duplicates(subset=all_cols), how='inner')
# Train once
参考:https://scikit-learn.org/stable/modules/tree.html
在训练数据中训练模型并在测试数据中测试数据:
from sklearn.tree import DecisionTreeClassifier
clf = DecisionTreeClassifier()
clf = clf.fit(X_train, y_train)
# predict the class of samples
y_predict = clf.predict(X_test)
#clf.score(X_test, y_test)
from sklearn.metrics import confusion_matrix, classification_report
confusion_matrix(y_test, y_predict)
accuracy_score(y_test,y_predict)*100
classification_report(y_test, y_predict)
from sklearn.metrics import accuracy_score
accuracy_score(y_test, y_predict)
from sklearn.metrics import precision_score
precision_score(y_test, y_predict)
from sklearn.metrics import f1_score
f1_score(y_test, y_predict)
如果我做对了,你的目标变量就是原始数据框中的 Attrition 属性。但是,我没有看到您要从特征集中删除此属性,即 X_train
和 X_test
。
如果您将尝试预测的目标作为分类器的特征传递给分类器,那么您是否获得 1.0 的所有分数也就不足为奇了。
我认为在代码片段中解决此问题的最简单方法是在适合分类器之前在 X_train
和 X_test
上调用 .pop()
:
from sklearn.tree import DecisionTreeClassifier
clf = DecisionTreeClassifier()
X_train.pop('Attrition') # <-- remove target variable
X_test.pop('Attrition') # <-- remove target variable
clf = clf.fit(X_train, y_train)
y_predict = clf.predict(X_test)
这应该可以解决问题,因为您不再将期望的结果作为输入传递。
我做了以下事情:
- 拆分测试和训练数据。
- 确保测试数据和训练数据之间没有任何共同点。
- 进行缩放以使训练数据具有相同数量的“是”和“否”。
但是,我总是得到一个1.0的最佳参数。这是为什么?
这是完整的代码:
from sklearn.tree import DecisionTreeClassifier
from random import randrange
import numpy as np
import seaborn as sns
import pandas as pd
import pandas.util.testing as tm
import matplotlib.pyplot as plt
from sklearn import preprocessing
参考(用于转换文本>数字):
https://www.kaggle.com/pavansubhasht/ibm-hr-analytics-attrition-dataset/discussion/86957
url = "https://raw.githubusercontent.com/furkan-ozbudak/machine-learning/master/input.csv"
# Import data
dataFrame = pd.read_csv(url)
# Drop non-priority features/columns
dataFrame = dataFrame.drop(columns=['Education', 'EmployeeCount', 'NumCompaniesWorked', 'Over18'])
features = [
'Attrition',
'BusinessTravel',
'Department',
'EducationField',
'Gender',
'JobRole',
'MaritalStatus',
'OverTime'
]
stringToNumericDict = {
"Yes":1, "No":0, "Y":1, "N":0,
"Non-Travel":0, "Travel_Frequently":2, "Travel_Rarely": 3,
"Research & Development": 2, "Human Resources":"1", "Sales": 3,
"Life Sciences": 2, "Medical":4, "Other":5, "Marketing": 3, "Technical Degree":6,
"Male": 2, "Female":1,
"Laboratory Technician": 3, "Healthcare Representative": 1, "Manufacturing Director":5,
"Sales Executive": 8, "Research Scientist": 7, "Research Director": 6,"Sales Representative": 9,
"Manager": 4,
"Married": 2, "Divorced": 1, "Single": 3,
}
# Convert Alphabets > Numeric
for feature in features:
dataFrame[feature].replace(stringToNumericDict, inplace=True)
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
y = dataFrame['Attrition']
在这里,我正在拆分我的数据以进行测试和训练。
X_train, X_test, y_train, y_test = train_test_split(dataFrame, y, test_size=0.3, random_state=1)
在这里,我对火车数据进行上采样,因为它有多数。在这一步之后,是和否都是 50%。
from sklearn.utils import resample
df_majority = X_train[X_train['Attrition']==0] # 0 = No
df_minority = X_train[X_train['Attrition']==1]
print("Count of 'No': %d(majority), Count of Yes: %d(minority)" % (len(df_majority), len(df_minority)))
# Upsample minority class
df_minority_upsampled = resample(df_minority,
replace=True, # sample with replacement
n_samples=869, # to match majority class
random_state=50) # reproducible results
# Combine majority class with upsampled minority class
X_train = pd.concat([df_majority, df_minority_upsampled])
# Display new class counts
X_train['Attrition'].value_counts()
# Change y_train in because X_train changed
y_train = X_train['Attrition'].values
sns.countplot(X_train['Attrition'])
all_cols = list(X_train.columns)
X_train.merge(X_test.drop_duplicates(subset=all_cols), how='inner')
# Train once
参考:https://scikit-learn.org/stable/modules/tree.html
在训练数据中训练模型并在测试数据中测试数据:
from sklearn.tree import DecisionTreeClassifier
clf = DecisionTreeClassifier()
clf = clf.fit(X_train, y_train)
# predict the class of samples
y_predict = clf.predict(X_test)
#clf.score(X_test, y_test)
from sklearn.metrics import confusion_matrix, classification_report
confusion_matrix(y_test, y_predict)
accuracy_score(y_test,y_predict)*100
classification_report(y_test, y_predict)
from sklearn.metrics import accuracy_score
accuracy_score(y_test, y_predict)
from sklearn.metrics import precision_score
precision_score(y_test, y_predict)
from sklearn.metrics import f1_score
f1_score(y_test, y_predict)
如果我做对了,你的目标变量就是原始数据框中的 Attrition 属性。但是,我没有看到您要从特征集中删除此属性,即 X_train
和 X_test
。
如果您将尝试预测的目标作为分类器的特征传递给分类器,那么您是否获得 1.0 的所有分数也就不足为奇了。
我认为在代码片段中解决此问题的最简单方法是在适合分类器之前在 X_train
和 X_test
上调用 .pop()
:
from sklearn.tree import DecisionTreeClassifier
clf = DecisionTreeClassifier()
X_train.pop('Attrition') # <-- remove target variable
X_test.pop('Attrition') # <-- remove target variable
clf = clf.fit(X_train, y_train)
y_predict = clf.predict(X_test)
这应该可以解决问题,因为您不再将期望的结果作为输入传递。