pandas 数据框掩码以将值写入新列

Question

基于，我在 Pandas 数据框上创建了几个掩码以创建一个新列，该列应该从不同的列中填充（根据条件）。

（简化的）代码如下所示：

mask0 = (df['condition'] == 1)
mask1 = (df['condition'] == 0)

df.loc[mask0, 'newColumn'] = df['otherColumn1']
df.loc[mask1, 'newColumn'] = df['otherColumn2']

但是在执行第三行时出现以下错误：

ValueError: cannot reindex from a duplicate axis

如果我只是这样做，它就有效：

df.loc[mask0, 'newColumn'] = 1

我做错了什么？

Answer 1

您还需要屏蔽“数据提供者”：

df.loc[mask0, 'newColumn'] = df[<b>mask0,</b> 'otherColumn1']
df.loc[mask1, 'newColumn'] = df[<b>mask1,</b> 'otherColumn2']

如果第一个条件为真，如果后者为假，反之亦然，我们可以使用np.where(..):

df['newColumn'] = np.where(mask0, df['otherColumn0'], df['otherColumn2'])

或者您可以使用 np.select(..) 以防两者都为假，如果两个条件都是 False:

，我们将使用旧值

df['newColumn'] = np.select(
    [mask0, mask1],
    [df['otherColumn1'], df['otherColumn2']],
    default=df['newColumn']
)

这里我们当然假设newColumn已经存在于dataframe中（例如通过一些先前的处理）。

Answer 2

你必须在两边过滤：

mask0 = (df['condition'] == 1)
mask1 = (df['condition'] == 0)

df.loc[mask0, 'newColumn'] = df.loc[mask0, 'otherColumn1']
df.loc[mask1, 'newColumn'] = df.loc[mask1, 'otherColumn2']

但这里最好使用 numpy.select 以避免重复代码：

df['newColumn'] = np.select([mask0, mask1], 
                            [df['otherColumn1'], df['otherColumn2']], 
                            default=np.nan)

Answer 3

另一个解决方案np.where：

df['newColumn'] = np.where(df['condition'].eq(1), df['otherColumn1'], df['condition'])
df['newColumn'] = np.where(df['condition'].eq(0), df['otherColumn2'], df['condition'])

pandas 数据框掩码以将值写入新列

pandas dataframe masks to write values into new column

python

mask

dataframe

pandas