如何检测数据集列中的可疑错误?
How to detect suspicious error in a column of a dataset?
我正在尝试对 github repository 中提供的名为 train.csv
的数据集中的数据进行编码。我使用了以下代码。
import pandas as pd
from sklearn import preprocessing
df = pd.read_csv(r'train.csv',index_col='Id')
df.head()
df['MSSubClass'].fillna(df['MSSubClass'].mean()//1)
df['MSZoning'].fillna(df['MSZoning'].mode())
label_encoder = preprocessing.LabelEncoder()
for col in df.columns:
if df[col].dtype == 'O':
print(df[col])
df[col] = label_encoder.fit_transform(df[col])
print(df)
编码时,输出提示如下。
MSSubClass
MSZoning
LotFrontage
LotArea
Street
Alley
TypeError: '<' not supported between instances of 'str' and 'float'
但是当我查看数据集时,Alley
列中没有任何 '<'
。
并且之前的列已经编码,但是 Alley
列导致错误。请帮助我!
存在问题,您的缺失值未在所有列中被替换,需要重新分配,还首先为 select 添加了 .iloc[0]
到 mode
,如果有 2 个或更多值:
from sklearn import preprocessing
df = pd.read_csv(r'train.csv',index_col='Id')
print (df)
colsNum = df.select_dtypes(np.number).columns
colsObj = df.columns.difference(colsNum)
df[colsNum] = df[colsNum].fillna(df[colsNum].mean()//1)
df[colsObj] = df[colsObj].fillna(df[colsObj].mode().iloc[0])
label_encoder = preprocessing.LabelEncoder()
for col in colsObj:
print(df[col])
df[col] = label_encoder.fit_transform(df[col])
print (df)
MSSubClass MSZoning LotFrontage LotArea Street Alley LotShape \
Id
1 60 3 65.0 8450 1 0 3
2 20 3 80.0 9600 1 0 3
3 60 3 68.0 11250 1 0 0
4 70 3 60.0 9550 1 0 0
5 60 3 84.0 14260 1 0 0
... ... ... ... ... ... ...
1456 60 3 62.0 7917 1 0 3
1457 20 3 85.0 13175 1 0 3
1458 70 3 66.0 9042 1 0 3
1459 20 3 68.0 9717 1 0 3
1460 20 3 75.0 9937 1 0 3
LandContour Utilities LotConfig ... PoolArea PoolQC Fence \
Id ...
1 3 0 4 ... 0 2 2
2 3 0 2 ... 0 2 2
3 3 0 4 ... 0 2 2
4 3 0 0 ... 0 2 2
5 3 0 2 ... 0 2 2
... ... ... ... ... ... ...
1456 3 0 4 ... 0 2 2
1457 3 0 4 ... 0 2 2
1458 3 0 4 ... 0 2 0
1459 3 0 4 ... 0 2 2
1460 3 0 4 ... 0 2 2
MiscFeature MiscVal MoSold YrSold SaleType SaleCondition SalePrice
Id
1 2 0 2 2008 8 4 208500
2 2 0 5 2007 8 4 181500
3 2 0 9 2008 8 4 223500
4 2 0 2 2006 8 0 140000
5 2 0 12 2008 8 4 250000
... ... ... ... ... ... ...
1456 2 0 8 2007 8 4 175000
1457 2 0 2 2010 8 4 210000
1458 2 2500 5 2010 8 4 266500
1459 2 0 4 2010 8 4 142125
1460 2 0 6 2008 8 4 147500
[1460 rows x 80 columns]
我正在尝试对 github repository 中提供的名为 train.csv
的数据集中的数据进行编码。我使用了以下代码。
import pandas as pd
from sklearn import preprocessing
df = pd.read_csv(r'train.csv',index_col='Id')
df.head()
df['MSSubClass'].fillna(df['MSSubClass'].mean()//1)
df['MSZoning'].fillna(df['MSZoning'].mode())
label_encoder = preprocessing.LabelEncoder()
for col in df.columns:
if df[col].dtype == 'O':
print(df[col])
df[col] = label_encoder.fit_transform(df[col])
print(df)
编码时,输出提示如下。
MSSubClass
MSZoning
LotFrontage
LotArea
Street
Alley
TypeError: '<' not supported between instances of 'str' and 'float'
但是当我查看数据集时,Alley
列中没有任何 '<'
。
并且之前的列已经编码,但是 Alley
列导致错误。请帮助我!
存在问题,您的缺失值未在所有列中被替换,需要重新分配,还首先为 select 添加了 .iloc[0]
到 mode
,如果有 2 个或更多值:
from sklearn import preprocessing
df = pd.read_csv(r'train.csv',index_col='Id')
print (df)
colsNum = df.select_dtypes(np.number).columns
colsObj = df.columns.difference(colsNum)
df[colsNum] = df[colsNum].fillna(df[colsNum].mean()//1)
df[colsObj] = df[colsObj].fillna(df[colsObj].mode().iloc[0])
label_encoder = preprocessing.LabelEncoder()
for col in colsObj:
print(df[col])
df[col] = label_encoder.fit_transform(df[col])
print (df)
MSSubClass MSZoning LotFrontage LotArea Street Alley LotShape \
Id
1 60 3 65.0 8450 1 0 3
2 20 3 80.0 9600 1 0 3
3 60 3 68.0 11250 1 0 0
4 70 3 60.0 9550 1 0 0
5 60 3 84.0 14260 1 0 0
... ... ... ... ... ... ...
1456 60 3 62.0 7917 1 0 3
1457 20 3 85.0 13175 1 0 3
1458 70 3 66.0 9042 1 0 3
1459 20 3 68.0 9717 1 0 3
1460 20 3 75.0 9937 1 0 3
LandContour Utilities LotConfig ... PoolArea PoolQC Fence \
Id ...
1 3 0 4 ... 0 2 2
2 3 0 2 ... 0 2 2
3 3 0 4 ... 0 2 2
4 3 0 0 ... 0 2 2
5 3 0 2 ... 0 2 2
... ... ... ... ... ... ...
1456 3 0 4 ... 0 2 2
1457 3 0 4 ... 0 2 2
1458 3 0 4 ... 0 2 0
1459 3 0 4 ... 0 2 2
1460 3 0 4 ... 0 2 2
MiscFeature MiscVal MoSold YrSold SaleType SaleCondition SalePrice
Id
1 2 0 2 2008 8 4 208500
2 2 0 5 2007 8 4 181500
3 2 0 9 2008 8 4 223500
4 2 0 2 2006 8 0 140000
5 2 0 12 2008 8 4 250000
... ... ... ... ... ... ...
1456 2 0 8 2007 8 4 175000
1457 2 0 2 2010 8 4 210000
1458 2 2500 5 2010 8 4 266500
1459 2 0 4 2010 8 4 142125
1460 2 0 6 2008 8 4 147500
[1460 rows x 80 columns]