如何检测数据集列中的可疑错误？

Question

我正在尝试对 github repository 中提供的名为 train.csv 的数据集中的数据进行编码。我使用了以下代码。

import pandas as pd 
from sklearn import preprocessing
df = pd.read_csv(r'train.csv',index_col='Id')
df.head()
df['MSSubClass'].fillna(df['MSSubClass'].mean()//1)
df['MSZoning'].fillna(df['MSZoning'].mode())
label_encoder = preprocessing.LabelEncoder() 
for col in df.columns:
    if df[col].dtype == 'O':
        print(df[col])
        df[col] = label_encoder.fit_transform(df[col])
print(df)

编码时，输出提示如下。

MSSubClass
MSZoning
LotFrontage
LotArea
Street
Alley
TypeError: '<' not supported between instances of 'str' and 'float'

但是当我查看数据集时，Alley 列中没有任何 '<'。并且之前的列已经编码，但是 Alley 列导致错误。请帮助我！

This is the colab notebook of the code

Answer 1

存在问题，您的缺失值未在所有列中被替换，需要重新分配，还首先为 select 添加了 .iloc[0] 到 mode，如果有 2 个或更多值：

from sklearn import preprocessing
df = pd.read_csv(r'train.csv',index_col='Id')
print (df)

colsNum = df.select_dtypes(np.number).columns
colsObj = df.columns.difference(colsNum)

df[colsNum] = df[colsNum].fillna(df[colsNum].mean()//1)
df[colsObj] = df[colsObj].fillna(df[colsObj].mode().iloc[0])

label_encoder = preprocessing.LabelEncoder() 
for col in colsObj:
    print(df[col])
    df[col] = label_encoder.fit_transform(df[col])

print (df)
      MSSubClass  MSZoning  LotFrontage  LotArea  Street  Alley  LotShape  \
Id                                                                          
1             60         3         65.0     8450       1      0         3   
2             20         3         80.0     9600       1      0         3   
3             60         3         68.0    11250       1      0         0   
4             70         3         60.0     9550       1      0         0   
5             60         3         84.0    14260       1      0         0   
         ...       ...          ...      ...     ...    ...       ...   
1456          60         3         62.0     7917       1      0         3   
1457          20         3         85.0    13175       1      0         3   
1458          70         3         66.0     9042       1      0         3   
1459          20         3         68.0     9717       1      0         3   
1460          20         3         75.0     9937       1      0         3   

      LandContour  Utilities  LotConfig  ...  PoolArea  PoolQC  Fence  \
Id                                       ...                            
1               3          0          4  ...         0       2      2   
2               3          0          2  ...         0       2      2   
3               3          0          4  ...         0       2      2   
4               3          0          0  ...         0       2      2   
5               3          0          2  ...         0       2      2   
          ...        ...        ...  ...       ...     ...    ...   
1456            3          0          4  ...         0       2      2   
1457            3          0          4  ...         0       2      2   
1458            3          0          4  ...         0       2      0   
1459            3          0          4  ...         0       2      2   
1460            3          0          4  ...         0       2      2   

      MiscFeature  MiscVal  MoSold  YrSold  SaleType  SaleCondition  SalePrice  
Id                                                                              
1               2        0       2    2008         8              4     208500  
2               2        0       5    2007         8              4     181500  
3               2        0       9    2008         8              4     223500  
4               2        0       2    2006         8              0     140000  
5               2        0      12    2008         8              4     250000  
          ...      ...     ...     ...       ...            ...        ...  
1456            2        0       8    2007         8              4     175000  
1457            2        0       2    2010         8              4     210000  
1458            2     2500       5    2010         8              4     266500  
1459            2        0       4    2010         8              4     142125  
1460            2        0       6    2008         8              4     147500  

[1460 rows x 80 columns]

如何检测数据集列中的可疑错误？

How to detect suspicious error in a column of a dataset?

preprocessor

dataframe

python-3.x

pandas

sklearn-pandas