Select 只有 Dataframe 中某些带有后缀的列的值不等于零的行

Select only those rows from a Dataframe where certain columns with suffix have values not equal to zero

我只想 select 数据框中某些带有后缀的列的值不等于零的行。而且列数更多,所以我需要一个通用的解决方案。

例如:

import pandas as pd
data = {
    'ID' : [1,2,3,4,5],
    'M_NEW':[10,12,14,16,18],
    'M_OLD':[10,12,14,16,18],
    'M_DIFF':[0,0,0,0,0],
    'CA_NEW':[10,12,16,16,18],
    'CA_OLD':[10,12,14,16,18],
    'CA_DIFF':[0,0,2,0,0],
    'BC_NEW':[10,12,14,16,18],
    'BC_OLD':[10,12,14,16,17],
    'BC_DIFF':[0,0,0,0,1]
}
df = pd.DataFrame(data)
df

数据框将是:

   ID  M_NEW  M_OLD  M_DIFF  CA_NEW  CA_OLD  CA_DIFF  BC_NEW  BC_OLD  BC_DIFF
0   1     10     10       0      10      10        0      10      10        0
1   2     12     12       0      12      12        0      12      12        0
2   3     14     14       0      16      14        2      14      14        0
3   4     16     16       0      16      16        0      16      16        0
4   5     18     18       0      18      18        0      18      17        1

所需的输出是:(因为 CA_DIFF 中有 2 个,BC_DIFF 中有 1 个)

   ID  M_NEW  M_OLD  M_DIFF  CA_NEW  CA_OLD  CA_DIFF  BC_NEW  BC_OLD  BC_DIFF
0   3     14     14       0      16      14        2      14      14        0
1   5     18     18       0      18      18        0      18      17        1

这适用于使用多个条件,但如果 DIFF 列的数量更多怎么办?像 20?有人可以提供通用解决方案吗?谢谢

你可以这样做:


...
# get all columns with X_DIFF
columns = df.columns[df.columns.str.contains('_DIFF')]

# check if any has value greater than 0
df[df[columns].transform(lambda x: x > 0).any(axis=1)]

您可以使用下面的函数,结合 pipe 根据各种条件过滤行:

In [22]: def filter_rows(df, dtype, columns, condition, any_True = True):
    ...:     temp = df.copy()
    ...:     if dtype:
    ...:         temp = df.select_dtypes(dtype)
    ...:     if columns:
    ...:         booleans = temp.loc[:, columns].transform(condition)
    ...:     else:
    ...:         booleans = temp.transform(condition)
    ...:     if any_True:
    ...:         booleans = booleans.any(axis = 1)
    ...:     else:
    ...:         booleans = booleans.all(axis = 1)
    ...: 
    ...:     return df.loc[booleans]

In [24]: df.pipe(filter_rows,
                 dtype=None, 
                 columns=lambda df: df.columns.str.endswith("_DIFF"),
                 condition= lambda df: df.ne(0)
                 )

Out[24]: 
   ID  M_NEW  M_OLD  M_DIFF  CA_NEW  CA_OLD  CA_DIFF  BC_NEW  BC_OLD  BC_DIFF
2   3     14     14       0      16      14        2      14      14        0
4   5     18     18       0      18      18        0      18      17        1