为所有可能的条件组合替换 pandas 中的列值

Replace column value in pandas for all possible combinations of conditions

我有一个数据框 (df),我想按如下方式操作

date,string
2002-01-01,ABAA
2002-01-01,AAAA
2002-01-01,CCCC
2002-01-01,BBAA
2002-01-01,AAAA
2002-01-02,BBBB
2002-01-02,BABA
2002-01-02,ABBB
2002-01-02,DCDC
2002-01-02,AABB
  1. 检查字符串以任何顺序包含 'D' 和 'C' 的所有情况,然后更改为 'DDDD'
  2. 同时检查所有出现的字符串包含 'A' 和 'B' 并更改为 'AAAA'

作为一个简单的例子,下面的代码将产生下面的预期输出。

import pandas as pd 
import numpy as np 
from datetime import datetime

df = pd.read_csv('df.csv')

df["string"].mask(df["string"] == 'DCDC', 'DDDD', inplace=True)
df["string"].mask(df["string"] == 'ABAA', 'AAAA', inplace=True)
df["string"].mask(df["string"] == 'BBAA', 'AAAA', inplace=True)
df["string"].mask(df["string"] == 'BABA', 'AAAA', inplace=True)
df["string"].mask(df["string"] == 'ABBB', 'AAAA', inplace=True)
df["string"].mask(df["string"] == 'AABB', 'AAAA', inplace=True)
print(df)

预期输出:

date,string
2002-01-01,AAAA
2002-01-01,AAAA
2002-01-01,CCCC
2002-01-01,AAAA
2002-01-01,AAAA
2002-01-02,BBBB
2002-01-02,AAAA
2002-01-02,AAAA
2002-01-02,DDDD
2002-01-02,AAAA

但是,上面的代码太硬了。我在想,我将首先像下面这样首先提取所有需要替换的实例:

letters_dc = ['D','C']
letters_ab = ['A','B']
contains_dc = [df['symbol'].str.contains(i) for i in letters_dc]
contains_ab = [df['symbol'].str.contains(i) for i in letters_ab]
resul = df[np.all(contains_dc, axis=0) | np.all(contains_ab, axis=0)]

我该如何从这里开始,或者有更好的方法来解决这个问题。

你可以使用这个:

ab = df['string'].str.match(r'^[AB]+$')
cd = df['string'].str.match(r'^[CD]+$')

newdf = df.assign(string=df['string'].where(~ab, 'AAAA').where(~cd, 'DDDD'))

>>> newdf
         date string
0  2002-01-01   AAAA
1  2002-01-01   AAAA
2  2002-01-01   DDDD
3  2002-01-01   AAAA
4  2002-01-01   AAAA
5  2002-01-02   AAAA
6  2002-01-02   AAAA
7  2002-01-02   AAAA
8  2002-01-02   DDDD
9  2002-01-02   AAAA

任何 string(完全)匹配 'C' 和 'D' 的任何组合都将替换为 'DDDD'。同样,任何 'AB' 组合都会变成 'AAAA'。所有其他值都保持不变。

您可以使用 numpy.logical_and.reduce:

import numpy as np

letters = [['D','C'], ['A','B']]
for l in letters:
    df['string'] = (df['string']
                    .mask(np.logical_and.reduce(
                          [df['string'].str.contains(x)
                           for x in l]), l[0]*4)
                   )

输出:

         date string
0  2002-01-01   AAAA
1  2002-01-01   AAAA
2  2002-01-01   CCCC
3  2002-01-01   AAAA
4  2002-01-01   AAAA
5  2002-01-02   BBBB
6  2002-01-02   AAAA
7  2002-01-02   AAAA
8  2002-01-02   DDDD
9  2002-01-02   AAAA