Select 只有 Dataframe 中某些带有后缀的列的值不等于零的行
Select only those rows from a Dataframe where certain columns with suffix have values not equal to zero
我只想 select 数据框中某些带有后缀的列的值不等于零的行。而且列数更多,所以我需要一个通用的解决方案。
例如:
import pandas as pd
data = {
'ID' : [1,2,3,4,5],
'M_NEW':[10,12,14,16,18],
'M_OLD':[10,12,14,16,18],
'M_DIFF':[0,0,0,0,0],
'CA_NEW':[10,12,16,16,18],
'CA_OLD':[10,12,14,16,18],
'CA_DIFF':[0,0,2,0,0],
'BC_NEW':[10,12,14,16,18],
'BC_OLD':[10,12,14,16,17],
'BC_DIFF':[0,0,0,0,1]
}
df = pd.DataFrame(data)
df
数据框将是:
ID M_NEW M_OLD M_DIFF CA_NEW CA_OLD CA_DIFF BC_NEW BC_OLD BC_DIFF
0 1 10 10 0 10 10 0 10 10 0
1 2 12 12 0 12 12 0 12 12 0
2 3 14 14 0 16 14 2 14 14 0
3 4 16 16 0 16 16 0 16 16 0
4 5 18 18 0 18 18 0 18 17 1
所需的输出是:(因为 CA_DIFF 中有 2 个,BC_DIFF 中有 1 个)
ID M_NEW M_OLD M_DIFF CA_NEW CA_OLD CA_DIFF BC_NEW BC_OLD BC_DIFF
0 3 14 14 0 16 14 2 14 14 0
1 5 18 18 0 18 18 0 18 17 1
这适用于使用多个条件,但如果 DIFF 列的数量更多怎么办?像 20?有人可以提供通用解决方案吗?谢谢
你可以这样做:
...
# get all columns with X_DIFF
columns = df.columns[df.columns.str.contains('_DIFF')]
# check if any has value greater than 0
df[df[columns].transform(lambda x: x > 0).any(axis=1)]
您可以使用下面的函数,结合 pipe
根据各种条件过滤行:
In [22]: def filter_rows(df, dtype, columns, condition, any_True = True):
...: temp = df.copy()
...: if dtype:
...: temp = df.select_dtypes(dtype)
...: if columns:
...: booleans = temp.loc[:, columns].transform(condition)
...: else:
...: booleans = temp.transform(condition)
...: if any_True:
...: booleans = booleans.any(axis = 1)
...: else:
...: booleans = booleans.all(axis = 1)
...:
...: return df.loc[booleans]
In [24]: df.pipe(filter_rows,
dtype=None,
columns=lambda df: df.columns.str.endswith("_DIFF"),
condition= lambda df: df.ne(0)
)
Out[24]:
ID M_NEW M_OLD M_DIFF CA_NEW CA_OLD CA_DIFF BC_NEW BC_OLD BC_DIFF
2 3 14 14 0 16 14 2 14 14 0
4 5 18 18 0 18 18 0 18 17 1
我只想 select 数据框中某些带有后缀的列的值不等于零的行。而且列数更多,所以我需要一个通用的解决方案。
例如:
import pandas as pd
data = {
'ID' : [1,2,3,4,5],
'M_NEW':[10,12,14,16,18],
'M_OLD':[10,12,14,16,18],
'M_DIFF':[0,0,0,0,0],
'CA_NEW':[10,12,16,16,18],
'CA_OLD':[10,12,14,16,18],
'CA_DIFF':[0,0,2,0,0],
'BC_NEW':[10,12,14,16,18],
'BC_OLD':[10,12,14,16,17],
'BC_DIFF':[0,0,0,0,1]
}
df = pd.DataFrame(data)
df
数据框将是:
ID M_NEW M_OLD M_DIFF CA_NEW CA_OLD CA_DIFF BC_NEW BC_OLD BC_DIFF
0 1 10 10 0 10 10 0 10 10 0
1 2 12 12 0 12 12 0 12 12 0
2 3 14 14 0 16 14 2 14 14 0
3 4 16 16 0 16 16 0 16 16 0
4 5 18 18 0 18 18 0 18 17 1
所需的输出是:(因为 CA_DIFF 中有 2 个,BC_DIFF 中有 1 个)
ID M_NEW M_OLD M_DIFF CA_NEW CA_OLD CA_DIFF BC_NEW BC_OLD BC_DIFF
0 3 14 14 0 16 14 2 14 14 0
1 5 18 18 0 18 18 0 18 17 1
这适用于使用多个条件,但如果 DIFF 列的数量更多怎么办?像 20?有人可以提供通用解决方案吗?谢谢
你可以这样做:
...
# get all columns with X_DIFF
columns = df.columns[df.columns.str.contains('_DIFF')]
# check if any has value greater than 0
df[df[columns].transform(lambda x: x > 0).any(axis=1)]
您可以使用下面的函数,结合 pipe
根据各种条件过滤行:
In [22]: def filter_rows(df, dtype, columns, condition, any_True = True):
...: temp = df.copy()
...: if dtype:
...: temp = df.select_dtypes(dtype)
...: if columns:
...: booleans = temp.loc[:, columns].transform(condition)
...: else:
...: booleans = temp.transform(condition)
...: if any_True:
...: booleans = booleans.any(axis = 1)
...: else:
...: booleans = booleans.all(axis = 1)
...:
...: return df.loc[booleans]
In [24]: df.pipe(filter_rows,
dtype=None,
columns=lambda df: df.columns.str.endswith("_DIFF"),
condition= lambda df: df.ne(0)
)
Out[24]:
ID M_NEW M_OLD M_DIFF CA_NEW CA_OLD CA_DIFF BC_NEW BC_OLD BC_DIFF
2 3 14 14 0 16 14 2 14 14 0
4 5 18 18 0 18 18 0 18 17 1