在 pandas 方法链接期间访问以前的数据帧
Access previous dataframe during pandas method chaining
Method chaining is a known way to improve code readability and often referred to as a Fluent API [1, ]。 Pandas 确实支持这种方法,因为可以像这样链接多个方法调用:
#!/usr/bin/env python3
# -*- coding: utf-8 -*-
import numpy as np
import pandas as pd
d = {'col1': [1, 2, 3, 4], 'col2': [5, np.nan, 7, 8], 'col3': [9, 10, 11, np.nan], 'col4': [np.nan, np.nan, np.nan, np.nan]}
df = (
pd
.DataFrame(d)
.set_index('col1')
.drop(labels='col3', axis=1)
)
print(df)
如果我需要访问从上一个函数调用返回的 DataFrame 的属性,我该如何使用方法链接?具体来说,我需要调用 .dropna()
on a column subset. As the DataFrame is generated from pd.concat()
确切的列名是先验未知的。因此,我目前正在使用这样的两步法:
#!/usr/bin/env python3
# -*- coding: utf-8 -*-
import numpy as np
import pandas as pd
d_1 = {'col1': [1, 2, 3, 4], 'col2': [5, np.nan, 7, 8], 'col3': [9, 10, 11, np.nan], 'col4': [np.nan, np.nan, np.nan, np.nan]}
d_2 = {'col10': [10, 20, 30, 40], 'col20': [50, np.nan, 70, 80], 'col30': [90, 100, 110, np.nan]}
df_1 = pd.DataFrame(d_1)
df_2 = pd.DataFrame(d_2)
df = pd.concat([df_1, df_2], axis=1)
print(df)
dropped = df.dropna(how='any', subset=[c for c in df.columns if c != 'col4'])
print(dropped)
有没有基于方法链的更优雅的方式? .dropna()
can certainly be chained, but I did not find a way to access the column names of the DataFrame resulting from the previous pd.concat()
。我想像
# pseudo-code
dropped = (
pd
.concat([df_1, df_2], axis=1)
.dropna(how='any', subset=<access columns of dataframe returned from previous concat and ignore desired column>)
)
print(dropped)
但没有找到解决办法。通过使用 .dropna()
和 inplace=True
选项就地重新分配变量,可以提高内存效率。但是,关于方法链接的可读性仍然没有提高。
使用pipe
:
dropped = (
pd
.concat([df_1, df_2], axis=1)
.pipe(lambda d: d.dropna(how='any',
subset=[c for c in d.columns if c != 'col4']))
)
输出:
col1 col2 col3 col4 col10 col20 col30
0 1 5.0 9.0 NaN 10 50.0 90.0
2 3 7.0 11.0 NaN 30 70.0 110.0
注意。 dropna
的替代语法:
lambda d: d.dropna(how='any', subset=d.columns.difference(['col4']))
Method chaining is a known way to improve code readability and often referred to as a Fluent API [1,
#!/usr/bin/env python3
# -*- coding: utf-8 -*-
import numpy as np
import pandas as pd
d = {'col1': [1, 2, 3, 4], 'col2': [5, np.nan, 7, 8], 'col3': [9, 10, 11, np.nan], 'col4': [np.nan, np.nan, np.nan, np.nan]}
df = (
pd
.DataFrame(d)
.set_index('col1')
.drop(labels='col3', axis=1)
)
print(df)
如果我需要访问从上一个函数调用返回的 DataFrame 的属性,我该如何使用方法链接?具体来说,我需要调用 .dropna()
on a column subset. As the DataFrame is generated from pd.concat()
确切的列名是先验未知的。因此,我目前正在使用这样的两步法:
#!/usr/bin/env python3
# -*- coding: utf-8 -*-
import numpy as np
import pandas as pd
d_1 = {'col1': [1, 2, 3, 4], 'col2': [5, np.nan, 7, 8], 'col3': [9, 10, 11, np.nan], 'col4': [np.nan, np.nan, np.nan, np.nan]}
d_2 = {'col10': [10, 20, 30, 40], 'col20': [50, np.nan, 70, 80], 'col30': [90, 100, 110, np.nan]}
df_1 = pd.DataFrame(d_1)
df_2 = pd.DataFrame(d_2)
df = pd.concat([df_1, df_2], axis=1)
print(df)
dropped = df.dropna(how='any', subset=[c for c in df.columns if c != 'col4'])
print(dropped)
有没有基于方法链的更优雅的方式? .dropna()
can certainly be chained, but I did not find a way to access the column names of the DataFrame resulting from the previous pd.concat()
。我想像
# pseudo-code
dropped = (
pd
.concat([df_1, df_2], axis=1)
.dropna(how='any', subset=<access columns of dataframe returned from previous concat and ignore desired column>)
)
print(dropped)
但没有找到解决办法。通过使用 .dropna()
和 inplace=True
选项就地重新分配变量,可以提高内存效率。但是,关于方法链接的可读性仍然没有提高。
使用pipe
:
dropped = (
pd
.concat([df_1, df_2], axis=1)
.pipe(lambda d: d.dropna(how='any',
subset=[c for c in d.columns if c != 'col4']))
)
输出:
col1 col2 col3 col4 col10 col20 col30
0 1 5.0 9.0 NaN 10 50.0 90.0
2 3 7.0 11.0 NaN 30 70.0 110.0
注意。 dropna
的替代语法:
lambda d: d.dropna(how='any', subset=d.columns.difference(['col4']))