pandas return 具有多个 'NA' 值的行的索引
pandas return index of rows having more than one 'NA' value
我的代码:
import pandas as pd
from sklearn.preprocessing import LabelEncoder
column_names = ["age","workclass","fnlwgt","education","education-num","marital-status","occupation","relationship","race","sex","capital-gain","capital-loss","hrs-per-week","native-country","income"]
adult_train = pd.read_csv("adult.data",header=None,sep=',\s',na_values=["?"])
adult_train.columns=column_names
adult_train.fillna('NA',inplace=True)
我想要在多个列中具有值 'NA' 的行的索引。是否有内置方法,或者我必须逐行迭代并检查每一列的值?
这是数据的快照:
我想要 398,409 行的索引(B 和 G 列缺失值),而不是 394 行(仅 N 列缺失值)的索引
使用 isnull.any(1)
或 sum
获取布尔掩码,然后 select 行获取索引,即
df = pd.DataFrame({'A':[1,2,3,4,5],
'B' :[np.nan,4,5,np.nan,8],
'C' :[2,4,np.nan,3,5],
'D' :[np.nan,np.nan,np.nan,np.nan,5]})
A B C D
0 1 NaN 2.0 NaN
1 2 4.0 4.0 NaN
2 3 5.0 NaN NaN
3 4 NaN 3.0 NaN
4 5 8.0 5.0 5.0
# If you want to select rows with nan value from Columns B and C
df.loc[df[['B','C']].isnull().any(1)].index
Int64Index([0, 2, 3], dtype='int64')
# If you want to rows with more than one nan then
df.loc[df.isnull().sum(1)>1].index
Int64Index([0, 2, 3], dtype='int64')
我的代码:
import pandas as pd
from sklearn.preprocessing import LabelEncoder
column_names = ["age","workclass","fnlwgt","education","education-num","marital-status","occupation","relationship","race","sex","capital-gain","capital-loss","hrs-per-week","native-country","income"]
adult_train = pd.read_csv("adult.data",header=None,sep=',\s',na_values=["?"])
adult_train.columns=column_names
adult_train.fillna('NA',inplace=True)
我想要在多个列中具有值 'NA' 的行的索引。是否有内置方法,或者我必须逐行迭代并检查每一列的值?
这是数据的快照:
我想要 398,409 行的索引(B 和 G 列缺失值),而不是 394 行(仅 N 列缺失值)的索引
使用 isnull.any(1)
或 sum
获取布尔掩码,然后 select 行获取索引,即
df = pd.DataFrame({'A':[1,2,3,4,5],
'B' :[np.nan,4,5,np.nan,8],
'C' :[2,4,np.nan,3,5],
'D' :[np.nan,np.nan,np.nan,np.nan,5]})
A B C D
0 1 NaN 2.0 NaN
1 2 4.0 4.0 NaN
2 3 5.0 NaN NaN
3 4 NaN 3.0 NaN
4 5 8.0 5.0 5.0
# If you want to select rows with nan value from Columns B and C
df.loc[df[['B','C']].isnull().any(1)].index
Int64Index([0, 2, 3], dtype='int64')
# If you want to rows with more than one nan then
df.loc[df.isnull().sum(1)>1].index
Int64Index([0, 2, 3], dtype='int64')