在我的 DecisionTree 模型上获得 100% 的准确性
Getting 100% Accuracy on my DecisionTree Model
这是我的代码,无论测试规模有多大,它总是 returns 100% 准确率。我使用了 train_test_split 方法,所以我不认为应该有任何重复的数据。有人可以检查我的代码吗?
from sklearn.tree import DecisionTreeClassifier
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
data = pd.read_csv('housing.csv')
prices = data['median_house_value']
features = data.drop(['median_house_value', 'ocean_proximity'], axis = 1)
prices.shape
(20640,)
features.shape
(20640, 8)
X_train, X_test, y_train, y_test = train_test_split(features, prices, test_size=0.2, random_state=42)
X_train = X_train.dropna()
y_train = y_train.dropna()
X_test = X_test.dropna()
y_test = X_test.dropna()
model = DecisionTreeClassifier()
model.fit(X_train, y_train)
y_train.shape
(16512,)
X_train.shape
(16512, 8)
predictions = model.predict(X_test)
score = model.score(y_test, predictions)
score
编辑: 因为我发现了多个问题,所以我修改了我的答案。请复制粘贴以下代码以确保没有错误。
问题 -
- 您正在使用
DecisionTreeClassifier
而不是 DecisionTreeRegressor
来解决回归问题。
- 您将在进行测试列车拆分后删除
nans
,这会弄乱样本计数。拆分前做 data.dropna()
。
- 您通过
(X_test, predictions)
传递 model.score(X_test, y_test)
是错误的。请改用 accuracy_score(X_test, predictions)
和这些参数,或修正语法。
from sklearn.tree import DecisionTreeRegressor #<---- FIRST ISSUE
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
data = pd.read_csv('housing.csv')
data = data.dropna() #<--- SECOND ISSUE
prices = data['median_house_value']
features = data.drop(['median_house_value', 'ocean_proximity'], axis = 1)
X_train, X_test, y_train, y_test = train_test_split(features, prices, test_size=0.2, random_state=42)
model = DecisionTreeRegressor()
model.fit(X_train, y_train)
predictions = model.predict(X_test)
score = accuracy_score(y_test, predictions) #<----- THIRD ISSUE
score
这是我的代码,无论测试规模有多大,它总是 returns 100% 准确率。我使用了 train_test_split 方法,所以我不认为应该有任何重复的数据。有人可以检查我的代码吗?
from sklearn.tree import DecisionTreeClassifier
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
data = pd.read_csv('housing.csv')
prices = data['median_house_value']
features = data.drop(['median_house_value', 'ocean_proximity'], axis = 1)
prices.shape
(20640,)
features.shape
(20640, 8)
X_train, X_test, y_train, y_test = train_test_split(features, prices, test_size=0.2, random_state=42)
X_train = X_train.dropna()
y_train = y_train.dropna()
X_test = X_test.dropna()
y_test = X_test.dropna()
model = DecisionTreeClassifier()
model.fit(X_train, y_train)
y_train.shape
(16512,)
X_train.shape
(16512, 8)
predictions = model.predict(X_test)
score = model.score(y_test, predictions)
score
编辑: 因为我发现了多个问题,所以我修改了我的答案。请复制粘贴以下代码以确保没有错误。
问题 -
- 您正在使用
DecisionTreeClassifier
而不是DecisionTreeRegressor
来解决回归问题。 - 您将在进行测试列车拆分后删除
nans
,这会弄乱样本计数。拆分前做data.dropna()
。 - 您通过
(X_test, predictions)
传递model.score(X_test, y_test)
是错误的。请改用accuracy_score(X_test, predictions)
和这些参数,或修正语法。
from sklearn.tree import DecisionTreeRegressor #<---- FIRST ISSUE
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
data = pd.read_csv('housing.csv')
data = data.dropna() #<--- SECOND ISSUE
prices = data['median_house_value']
features = data.drop(['median_house_value', 'ocean_proximity'], axis = 1)
X_train, X_test, y_train, y_test = train_test_split(features, prices, test_size=0.2, random_state=42)
model = DecisionTreeRegressor()
model.fit(X_train, y_train)
predictions = model.predict(X_test)
score = accuracy_score(y_test, predictions) #<----- THIRD ISSUE
score