在 pandas 方法链接期间访问以前的数据帧

Access previous dataframe during pandas method chaining

Method chaining is a known way to improve code readability and often referred to as a Fluent API [1, ]。 Pandas 确实支持这种方法,因为可以像这样链接多个方法调用:

#!/usr/bin/env python3
# -*- coding: utf-8 -*-

import numpy as np
import pandas as pd


d = {'col1': [1, 2, 3, 4], 'col2': [5, np.nan, 7, 8], 'col3': [9, 10, 11, np.nan], 'col4': [np.nan, np.nan, np.nan, np.nan]}

df = (
    pd
    .DataFrame(d)
    .set_index('col1')
    .drop(labels='col3', axis=1)
)

print(df)

如果我需要访问从上一个函数调用返回的 DataFrame 的属性,我该如何使用方法链接?具体来说,我需要调用 .dropna() on a column subset. As the DataFrame is generated from pd.concat() 确切的列名是先验未知的。因此,我目前正在使用这样的两步法:

#!/usr/bin/env python3
# -*- coding: utf-8 -*-

import numpy as np
import pandas as pd

d_1 = {'col1': [1, 2, 3, 4], 'col2': [5, np.nan, 7, 8], 'col3': [9, 10, 11, np.nan], 'col4': [np.nan, np.nan, np.nan, np.nan]}
d_2 = {'col10': [10, 20, 30, 40], 'col20': [50, np.nan, 70, 80], 'col30': [90, 100, 110, np.nan]}

df_1 = pd.DataFrame(d_1)
df_2 = pd.DataFrame(d_2)

df = pd.concat([df_1, df_2], axis=1)
print(df)

dropped = df.dropna(how='any', subset=[c for c in df.columns if c != 'col4'])
print(dropped)

有没有基于方法链的更优雅的方式? .dropna() can certainly be chained, but I did not find a way to access the column names of the DataFrame resulting from the previous pd.concat()。我想像

# pseudo-code
dropped = (
    pd
    .concat([df_1, df_2], axis=1)
    .dropna(how='any', subset=<access columns of dataframe returned from previous concat and ignore desired column>)
)
print(dropped)

但没有找到解决办法。通过使用 .dropna()inplace=True 选项就地重新分配变量,可以提高内存效率。但是,关于方法链接的可读性仍然没有提高。

使用pipe:

dropped = (
    pd
    .concat([df_1, df_2], axis=1)
    .pipe(lambda d: d.dropna(how='any',
                             subset=[c for c in d.columns if c != 'col4']))
)

输出:

   col1  col2  col3  col4  col10  col20  col30
0     1   5.0   9.0   NaN     10   50.0   90.0
2     3   7.0  11.0   NaN     30   70.0  110.0

注意。 dropna 的替代语法:

lambda d: d.dropna(how='any', subset=d.columns.difference(['col4']))