ValueError: Found unknown categories [nan] in column 2 during fit
ValueError: Found unknown categories [nan] in column 2 during fit
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split,cross_val_score
from sklearn.tree import DecisionTreeClassifier
path = r"C:\Users\thund\Downloads\Boat.csv"
data = pd.read_csv(path) # pip install xlrd
print(data.shape)
print(data.columns)
print(data.isnull().sum())
print (data.dropna(axis=0)) #dropping rows that have missing values
print (data['Class'].value_counts())
print(data['Class'].value_counts().plot(kind = 'bar'))
#plt.show()
data['safety'].value_counts().plot(kind = 'bar')
#plt.show()
import seaborn as sns
sns.countplot(data['demand'], hue = data['Class'])
#plt.show()
X = data.drop(['Class'], axis = 1)
y = data['Class']
from sklearn.preprocessing import OrdinalEncoder
demand_category = ['low', 'med', 'high', 'vhigh']
maint_category = ['low', 'med', 'high', 'vhigh']
seats_category = ['2', '3', '4', '5more']
passenger_category = ['2', '4', 'more']
storage_category = ['Nostorage', 'small', 'med']
safety_category = ['poor', 'good', 'vgood']
all_categories = [demand_category, maint_category,seats_category,passenger_category,storage_category,safety_category]
oe = OrdinalEncoder(categories= all_categories)
X = oe.fit_transform( data[['demand','maint', 'seats', 'passenger', 'storage', 'safety']])
数据集:https://drive.google.com/file/d/1O0sYZGJep4JkrSgGeJc5e_Nlao2bmegV/view?usp=sharing
对于提到的代码,我不断得到这个 'ValueError: Found unknown categories [nan] in column 2 during fit'。我试过删除所有缺失值。我尝试搜索修复,发现有人建议使用 handle_unknown="ignore",但我认为它不适用于序号编码。
我是 python 的新手,所以如果有人能给我深入分析为什么会发生这种情况以及我该如何解决它,我将不胜感激。
Ps:这是为了预处理数据。
为了解释这个错误,您删除了“NaN”并且只是用删除的数据打印了 DataFrame。
根据您的数据集或错误,您在“座位”列中有一个值“NaN”。
当你打印出 data['seats'].unique()
时,你会得到这样的结果:
['2' '3' '4' '5more' nan]
有两种解决方案:
就地使用:
`data.dropna(inplace=True)`
它所做的是,将原始 DataFrame 更新为其更新后的值
手动分配:
`data = data.dropna()`
这正是 'inplace' 所做的,但效率不高,但更容易理解。
希望这能回答您的问题。
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split,cross_val_score
from sklearn.tree import DecisionTreeClassifier
path = r"C:\Users\thund\Downloads\Boat.csv"
data = pd.read_csv(path) # pip install xlrd
print(data.shape)
print(data.columns)
print(data.isnull().sum())
print (data.dropna(axis=0)) #dropping rows that have missing values
print (data['Class'].value_counts())
print(data['Class'].value_counts().plot(kind = 'bar'))
#plt.show()
data['safety'].value_counts().plot(kind = 'bar')
#plt.show()
import seaborn as sns
sns.countplot(data['demand'], hue = data['Class'])
#plt.show()
X = data.drop(['Class'], axis = 1)
y = data['Class']
from sklearn.preprocessing import OrdinalEncoder
demand_category = ['low', 'med', 'high', 'vhigh']
maint_category = ['low', 'med', 'high', 'vhigh']
seats_category = ['2', '3', '4', '5more']
passenger_category = ['2', '4', 'more']
storage_category = ['Nostorage', 'small', 'med']
safety_category = ['poor', 'good', 'vgood']
all_categories = [demand_category, maint_category,seats_category,passenger_category,storage_category,safety_category]
oe = OrdinalEncoder(categories= all_categories)
X = oe.fit_transform( data[['demand','maint', 'seats', 'passenger', 'storage', 'safety']])
数据集:https://drive.google.com/file/d/1O0sYZGJep4JkrSgGeJc5e_Nlao2bmegV/view?usp=sharing
对于提到的代码,我不断得到这个 'ValueError: Found unknown categories [nan] in column 2 during fit'。我试过删除所有缺失值。我尝试搜索修复,发现有人建议使用 handle_unknown="ignore",但我认为它不适用于序号编码。 我是 python 的新手,所以如果有人能给我深入分析为什么会发生这种情况以及我该如何解决它,我将不胜感激。
Ps:这是为了预处理数据。
为了解释这个错误,您删除了“NaN”并且只是用删除的数据打印了 DataFrame。
根据您的数据集或错误,您在“座位”列中有一个值“NaN”。
当你打印出 data['seats'].unique()
时,你会得到这样的结果:
['2' '3' '4' '5more' nan]
有两种解决方案:
就地使用:
`data.dropna(inplace=True)`
它所做的是,将原始 DataFrame 更新为其更新后的值
手动分配:
`data = data.dropna()`
这正是 'inplace' 所做的,但效率不高,但更容易理解。
希望这能回答您的问题。