如何通过匹配列名从稀有标签中解码列值

How to decode column value from rare label by matching column names

我有两个数据框,如下所示

import numpy as np
import pandas as pd
from numpy.random import default_rng
rng = default_rng(100)
cdf = pd.DataFrame({'Id':[1,2,3,4,5],
                   'grade': rng.choice(list('ACD'),size=(5)),
                   'dash': rng.choice(list('PQRS'),size=(5)),
                   'dumeel': rng.choice(list('QWER'),size=(5)),
                   'dumma': rng.choice((1234),size=(5)),
                   'target': rng.choice([0,1],size=(5))
})

tdf = pd.DataFrame({'Id': [1,1,1,1,3,3,3],
                   'feature': ['grade=Rare','dash=Q','dumma=rare','dumeel=R','dash=Rare','dumma=rare','grade=D'],
                   'value': [0.2,0.45,-0.32,0.56,1.3,1.5,3.7]})

我的objective是

a) 用 cdf 数据帧的原始值替换 tdf 数据帧 feature 列中的 Rarerare 值。

b) 要识别原始值,我们可以使用 = Rare=rare= rare 等之前的字符串。该字符串表示 [=16= 中的列名] dataframe(从哪里可以找到要替换的原始值rare

我正在尝试类似下面的操作,但不确定如何从这里开始

replace_df = cdf.merge(tdf,how='inner',on='Id')
replace_df ["replaced_feature"] = np.where(((replace_df["feature"].str.contains('rare',regex=True)]) & (replace_df["feature"].str.split('='))]) 

我必须将此应用到一个大数据中,其中我有数百万行和 1000 多个替换项。

我希望我的输出如下所示

# list comprehension to find where rare is in the feature col
tdf['feature'] = [x if y.lower()=='rare' else x+'='+y for x,y in tdf['feature'].str.split('=')]
# create a mask where feature is in columns of cdf
mask = tdf['feature'].isin(cdf.columns)
# use loc to filter your frame and use merge to join cdf on the id and feature column - after you use stack
tdf.loc[mask, 'feature'] = tdf.loc[mask, 'feature']+'='+tdf.loc[mask].merge(cdf.set_index('Id').stack().to_frame(),
                                                                            right_index=True, left_on=['Id', 'feature'])[0].astype(str)

   Id     feature  value
0   1     grade=D   0.20
1   1      dash=Q   0.45
2   1  dumma=1123  -0.32
3   1    dumeel=R   0.56
4   3      dash=P   1.30
5   3   dumma=849   1.50
6   3     grade=D   3.70

我的感觉是没有必要寻找 Rare 值。 从 tdf 中提取列名称以在 cdf 中查找。之后,展平您的 cdf 数据框以提取正确的值:

r = tdf.set_index('Id')['feature'].str.split('=').str[0].str.lower()

tdf['feature'] = r.values + '=' + cdf.set_index('Id').unstack() \
                                     .loc[zip(r.values, r.index)] \
                                     .astype(str).values

输出:

>>> tdf
   Id     feature  value
0   1     grade=D   0.20
1   1      dash=Q   0.45
2   1  dumma=1123  -0.32
3   1    dumeel=R   0.56
4   3      dash=P   1.30
5   3   dumma=849   1.50
6   3     grade=A   3.70

>>> r
Id           # <- the index is the row of cdf
1     grade  # <- the values are the column of cdf
1      dash
1     dumma
1    dumeel
3      dash
3     dumma
3     grade
Name: feature, dtype: object

这是一种可能的方法,使用 MultiIndex.mapcdf 中的值替换为 tdf

s = tdf['feature'].str.split('=')
m = s.str[1].isin(['rare', 'Rare'])
v = tdf[m].set_index(['Id', s[m].str[0]]).index.map(cdf.set_index('Id').stack())

tdf.loc[m, 'feature'] = s[m].str[0] + '=' + v.astype(str)

print(tdf)

   Id     feature  value
0   1     grade=D   0.20
1   1      dash=Q   0.45
2   1  dumma=1123  -0.32
3   1    dumeel=R   0.56
4   3      dash=P   1.30
5   3   dumma=849   1.50
6   3     grade=D   3.70