为所有可能的条件组合替换 pandas 中的列值
Replace column value in pandas for all possible combinations of conditions
我有一个数据框 (df),我想按如下方式操作
date,string
2002-01-01,ABAA
2002-01-01,AAAA
2002-01-01,CCCC
2002-01-01,BBAA
2002-01-01,AAAA
2002-01-02,BBBB
2002-01-02,BABA
2002-01-02,ABBB
2002-01-02,DCDC
2002-01-02,AABB
- 检查字符串以任何顺序包含 'D' 和 'C' 的所有情况,然后更改为 'DDDD'
- 同时检查所有出现的字符串包含 'A' 和 'B' 并更改为 'AAAA'
作为一个简单的例子,下面的代码将产生下面的预期输出。
import pandas as pd
import numpy as np
from datetime import datetime
df = pd.read_csv('df.csv')
df["string"].mask(df["string"] == 'DCDC', 'DDDD', inplace=True)
df["string"].mask(df["string"] == 'ABAA', 'AAAA', inplace=True)
df["string"].mask(df["string"] == 'BBAA', 'AAAA', inplace=True)
df["string"].mask(df["string"] == 'BABA', 'AAAA', inplace=True)
df["string"].mask(df["string"] == 'ABBB', 'AAAA', inplace=True)
df["string"].mask(df["string"] == 'AABB', 'AAAA', inplace=True)
print(df)
预期输出:
date,string
2002-01-01,AAAA
2002-01-01,AAAA
2002-01-01,CCCC
2002-01-01,AAAA
2002-01-01,AAAA
2002-01-02,BBBB
2002-01-02,AAAA
2002-01-02,AAAA
2002-01-02,DDDD
2002-01-02,AAAA
但是,上面的代码太硬了。我在想,我将首先像下面这样首先提取所有需要替换的实例:
letters_dc = ['D','C']
letters_ab = ['A','B']
contains_dc = [df['symbol'].str.contains(i) for i in letters_dc]
contains_ab = [df['symbol'].str.contains(i) for i in letters_ab]
resul = df[np.all(contains_dc, axis=0) | np.all(contains_ab, axis=0)]
我该如何从这里开始,或者有更好的方法来解决这个问题。
你可以使用这个:
ab = df['string'].str.match(r'^[AB]+$')
cd = df['string'].str.match(r'^[CD]+$')
newdf = df.assign(string=df['string'].where(~ab, 'AAAA').where(~cd, 'DDDD'))
>>> newdf
date string
0 2002-01-01 AAAA
1 2002-01-01 AAAA
2 2002-01-01 DDDD
3 2002-01-01 AAAA
4 2002-01-01 AAAA
5 2002-01-02 AAAA
6 2002-01-02 AAAA
7 2002-01-02 AAAA
8 2002-01-02 DDDD
9 2002-01-02 AAAA
任何 string
(完全)匹配 'C' 和 'D' 的任何组合都将替换为 'DDDD'
。同样,任何 'AB'
组合都会变成 'AAAA'
。所有其他值都保持不变。
您可以使用 numpy.logical_and.reduce
:
import numpy as np
letters = [['D','C'], ['A','B']]
for l in letters:
df['string'] = (df['string']
.mask(np.logical_and.reduce(
[df['string'].str.contains(x)
for x in l]), l[0]*4)
)
输出:
date string
0 2002-01-01 AAAA
1 2002-01-01 AAAA
2 2002-01-01 CCCC
3 2002-01-01 AAAA
4 2002-01-01 AAAA
5 2002-01-02 BBBB
6 2002-01-02 AAAA
7 2002-01-02 AAAA
8 2002-01-02 DDDD
9 2002-01-02 AAAA
我有一个数据框 (df),我想按如下方式操作
date,string
2002-01-01,ABAA
2002-01-01,AAAA
2002-01-01,CCCC
2002-01-01,BBAA
2002-01-01,AAAA
2002-01-02,BBBB
2002-01-02,BABA
2002-01-02,ABBB
2002-01-02,DCDC
2002-01-02,AABB
- 检查字符串以任何顺序包含 'D' 和 'C' 的所有情况,然后更改为 'DDDD'
- 同时检查所有出现的字符串包含 'A' 和 'B' 并更改为 'AAAA'
作为一个简单的例子,下面的代码将产生下面的预期输出。
import pandas as pd
import numpy as np
from datetime import datetime
df = pd.read_csv('df.csv')
df["string"].mask(df["string"] == 'DCDC', 'DDDD', inplace=True)
df["string"].mask(df["string"] == 'ABAA', 'AAAA', inplace=True)
df["string"].mask(df["string"] == 'BBAA', 'AAAA', inplace=True)
df["string"].mask(df["string"] == 'BABA', 'AAAA', inplace=True)
df["string"].mask(df["string"] == 'ABBB', 'AAAA', inplace=True)
df["string"].mask(df["string"] == 'AABB', 'AAAA', inplace=True)
print(df)
预期输出:
date,string
2002-01-01,AAAA
2002-01-01,AAAA
2002-01-01,CCCC
2002-01-01,AAAA
2002-01-01,AAAA
2002-01-02,BBBB
2002-01-02,AAAA
2002-01-02,AAAA
2002-01-02,DDDD
2002-01-02,AAAA
但是,上面的代码太硬了。我在想,我将首先像下面这样首先提取所有需要替换的实例:
letters_dc = ['D','C']
letters_ab = ['A','B']
contains_dc = [df['symbol'].str.contains(i) for i in letters_dc]
contains_ab = [df['symbol'].str.contains(i) for i in letters_ab]
resul = df[np.all(contains_dc, axis=0) | np.all(contains_ab, axis=0)]
我该如何从这里开始,或者有更好的方法来解决这个问题。
你可以使用这个:
ab = df['string'].str.match(r'^[AB]+$')
cd = df['string'].str.match(r'^[CD]+$')
newdf = df.assign(string=df['string'].where(~ab, 'AAAA').where(~cd, 'DDDD'))
>>> newdf
date string
0 2002-01-01 AAAA
1 2002-01-01 AAAA
2 2002-01-01 DDDD
3 2002-01-01 AAAA
4 2002-01-01 AAAA
5 2002-01-02 AAAA
6 2002-01-02 AAAA
7 2002-01-02 AAAA
8 2002-01-02 DDDD
9 2002-01-02 AAAA
任何 string
(完全)匹配 'C' 和 'D' 的任何组合都将替换为 'DDDD'
。同样,任何 'AB'
组合都会变成 'AAAA'
。所有其他值都保持不变。
您可以使用 numpy.logical_and.reduce
:
import numpy as np
letters = [['D','C'], ['A','B']]
for l in letters:
df['string'] = (df['string']
.mask(np.logical_and.reduce(
[df['string'].str.contains(x)
for x in l]), l[0]*4)
)
输出:
date string
0 2002-01-01 AAAA
1 2002-01-01 AAAA
2 2002-01-01 CCCC
3 2002-01-01 AAAA
4 2002-01-01 AAAA
5 2002-01-02 BBBB
6 2002-01-02 AAAA
7 2002-01-02 AAAA
8 2002-01-02 DDDD
9 2002-01-02 AAAA