如何一次测试 pandas 的多个列的条件并更新它们

Question

我有这样一个数据框：

test = pd.DataFrame({'col1':[10,20,30,40], 'col2':[5,10,15,20], 'col3':[6,12,18,24]})
test

数据框看起来像：

   col1 col2 col3
0   10  5    6
1   20  10   12
2   30  15   18
3   40  20   24

我想将 col2 或 col3 中大于 10 的值替换为零。为此，我想使用 loc 函数。
我想要的输出是：

   col1 col2 col3
0   10  5    6
1   20  10   0
2   30  0    0
3   40  0    0

我尝试了以下解决方案：

cols_to_update = ['col2', 'col3']
test.loc[test[cols_to_update]>10]=0
test

显示以下错误：

KeyError: "None of [Index([('c', 'o', 'l', '1'), ('c', 'o', 'l', '2')], dtype='object')] are in the [index]"

当我使用单列测试条件时，它没有显示 'KeyError'，但现在它也替换了其他两列中的值。

test.loc[test['col2']>10]=0
test

输出为：

   col1 col2 col3
0   10  5    6
1   0   0    0
2   0   0    0
3   0   0    0

我们可以为此目的使用 loc 吗？
为什么 loc 会这样？
什么是有效的解决方案？

Answer 1

我会使用 numpy.where 有条件地替换多列的值：

import numpy as np

cols_to_update = ['col2', 'col3']
test[cols_to_update] = np.where(test[cols_to_update] > 10, 0, test[cols_to_update])

表达式test[cols_to_update] > 10给你一个布尔掩码：

    col2   col3
0  False  False
1  False   True
2   True   True
3   True   True

然后，只要掩码为 True，np.where 就会选择值 0，或者只要掩码为 False，它就会选择相应的原始数据 test[cols_to_update] .

您的解决方案 test.loc[test[cols_to_update]>10]=0 不起作用，因为在这种情况下 loc 需要布尔值一维系列，而 test[cols_to_update]>10 仍然是具有两列的 DataFrame。这也是为什么你不能使用 loc 解决这个问题的原因（至少不能不循环遍历列）：第 2 列和第 3 列的值满足条件 > 10 的索引不同。

在这种情况下，什么时候 loc 合适？例如，如果您想将第 2 列和第 3 列中的任何一个大于 10 时都设置为零：

test.loc[(test[cols_to_update] > 10).any(axis=1), cols_to_update] = 0
test
# out:
   col1  col2  col3
0    10     5     6
1    20     0     0
2    30     0     0
3    40     0     0

在这种情况下，您使用一维系列 ((test[cols_to_update] > 10).any(axis=1)) 进行索引，这是 loc.

的合适用例

Answer 2

您可以使用 where:

import pandas as pd

test = pd.DataFrame({'col1':[10,20,30,40], 'col2':[5,10,15,20], 'col3':[6,12,18,24]})
test[['col2', 'col3']] = test[['col2', 'col3']].where(test[['col2', 'col3']] <= 10, 0)

输出：

	col1	col2	col3
0	10	5	6
1	20	10	0
2	30	0	0
3	40	0	0

如何一次测试 pandas 的多个列的条件并更新它们

How to Test multiple columns of pandas for a condition at once and update them

python

dataframe

pandas

pandas-loc