df.dropna(axis='rows') 和 df.drop(index=np.where(df.isnull().sum()!=0)[0] 的不同结果, 轴='index')

Question

我正在尝试从数据框中删除所有包含 NaN 值的行。但是，我已经意识到使用 df.dropna(axis='rows') 而使用 df.drop(index=np.where(df.isnull().sum()!=0)[0], axis='index') 会给出不同的结果。前者删除的行数少于后者。

例如，我的初始数据框有 80 列和 91713 行。

如果我使用 dropna() 生成的数据框有 80 列和 91639 行（例如，删除了 74 行）。
如果我改为使用 drop()，则新形状为 80 列和 56935 行（例如，删除了 34778 行）。

我将索引输入 df.drop() 的方式有问题吗？如果我只查看使用该方法删除的索引数，我确实得到了 74 列。例如 df_nulls = df.iloc[np.where(df.isnull().sum()!=0)[0]]，df_nulls.shape[0] 是 74.

更新：我知道 df.drop() 方法肯定有问题，因为当我尝试运行进一步处理数据时，我得到与仍然存在 NaN 相关的错误。但为什么 np.where(df.isnull().sum()!=0) 找不到所有 NaN 值？

更新 2：这肯定只是我的索引有问题（见下文），但 iloc 不应该给出行吗？

indices_rows_with_nulls = np.where(df.isnull().sum()!=0)[0] 
df_nulls = df.iloc[indices_rows_with_nulls] 
print('df.shape: '+ str(df.shape)+'   df_nulls.shape: '+ str(df_nulls.shape))
indices_rows_without_nulls = np.where(df.isnull().sum()==0)[0] 
df_no_nulls = df.iloc[indices_rows_without_nulls]
print('df.shape: '+ str(df.shape)+'   df_no_nulls.shape: '+ str(df_no_nulls.shape))

给予

df.shape: (91713, 80)   df_nulls.shape: (74, 80)
df.shape: (91713, 80)   df_no_nulls.shape: (6, 80)

Answer 1

您需要对列求和

df.isnull().sum(axis=1)!=0

# or

df.isnull().sum(axis='columns')!=0

df.dropna(axis='rows') 和 df.drop(index=np.where(df.isnull().sum()!=0)[0] 的不同结果, 轴='index')

Different results with df.dropna(axis='rows') and df.drop(index=np.where(df.isnull().sum()!=0)[0], axis='index')

python

dataframe

pandas

data-cleaning