Pandas 在多个大型数据帧中查找常见的 NA 记录

Pandas find common NA records across multiple large dataframes

我有 3 个数据框,如下所示

ID,col1,col2
1,X,35
2,X,37
3,nan,32
4,nan,34
5,X,21
df1 = pd.read_clipboard(sep=',',skipinitialspace=True)

ID,col1,col2
1,nan,305
2,X,307
3,X,302
4,nan,304
5,X,201
df2 = pd.read_clipboard(sep=',',skipinitialspace=True)

ID,col1,col2
1,X,315
2,nan,317
3,X,312
4,nan,314
5,X,21
df3 = pd.read_clipboard(sep=',',skipinitialspace=True)

现在我想在所有 3 个输入数据帧中识别 IDs,其中 col1NA

所以,我尝试了以下方法

L1=df1[df1['col1'].isna()]['ID'].tolist()
L2=df2[df2['col1'].isna()]['ID'].tolist()
L3=df3[df3['col1'].isna()]['ID'].tolist()
common_ids_all = list(set.intersection(*map(set, [L1,L2,L3])))
final_df = pd.concat([df1,df2,df3],ignore_index=True)
final_df[final_df['ID'].isin(common_ids_all)]

虽然上面的方法有效,但有没有有效和优雅的方法来做上面的事情?

如您所见,我重复了同一条语句三次(对于 3 个数据帧)

但是,在我的真实数据中,我有 12 个数据帧,我必须在其中获取 ID,其中 col1 在所有 12 个数据帧中都是 NA。

更新 - 我当前的读取操作如下所示

fnames = ['file1.xlsx','file2.xlsx', 'file3.xlsx']
dfs=[]
NA_list=[]
def preprocessing(fname):
    df= pd.read_excel(fname, sheet_name="Sheet1")
    df.columns = df.iloc[7]
    df = df.iloc[8: , :]
    NA_list.append(df[df['col1'].isna()]['ID'])
    dfs.append(df)
[preprocessing(fname) for fname in fnames]
final_df = pd.concat(dfs, ignore_index=True)
L1 = NA_list[0]
L2 = NA_list[1]
L3 = NA_list[2]
final_list = (list(set.intersection(*map(set, [L1,L2,L3]))))
final_df[final_df['ID'].isin(final_list)]

这是 def 函数为您排序的时候。如果数据框列表会不断变化,我将创建一个 def 函数。如果我没猜错,下面的内容就可以了;

def CombinedNaNs(lst):
newdflist =[]
for d in dflist:
    newdflist.append(d[d['col1'].isna()])
    s=pd.concat(newdflist)
    
return s[s.duplicated(subset=['ID'], keep=False)].drop_duplicates()

 dflist=[df1,df2,df3]#List of dfs

CombinedNaNs(dflist)#apply function



    ID col1  col2
3   4  NaN    34
3   4  NaN   304
3   4  NaN   314

您可以使用:

dfs = [df1, df2, df3]
final_df = pd.concat(dfs).query('col1.isna()')
final_df = final_df[final_df.groupby('ID')['ID'].transform('size') == len(dfs)]
print(final_df)

# Output
   ID col1  col2
3   4  NaN    34
3   4  NaN   304
3   4  NaN   314

完整代码:

fnames = ['file1.xlsx','file2.xlsx', 'file3.xlsx']

def preprocessing(fname):
    return pd.read_excel(fname, sheet_name='Sheet1', skiprows=6)

dfs = [preprocessing(fname) for fname in fnames]
final_df = pd.concat([df[df['col1'].isna()] for df in dfs])
final_df = final_df[final_df.groupby('ID')['ID'].transform('size') == len(dfs)]