如何 select 并替换字符串 pandas 列中的主要关键字?

How do I select and replace the main keyword in a pandas column of string?

这是我的数据

Id  Keyword
1   ayam e-commerce
2   biaya fuel personal wallet
3   pulsa sms virtualaccount
4   biaya koperasi personal
5   familymart personal
6   e-commerce pln
7   biaya onus
8   koperasi personal
9   biaya familymart personal
10  fuel personal wallet
11  fuel travel

我希望每个存在关键字(例如 fuelplnayam 的关键字都缩短为 fuelplnayam,所以输出会像这样

Id  Keyword
1   ayam
2   biaya fuel personal wallet
3   pulsa sms virtualaccount
4   biaya koperasi personal
5   familymart personal
6   pln
7   biaya onus
8   koperasi personal
9   biaya familymart personal
10  fuel
11  fuel

我该怎么做?

为了仅替换第一个匹配的词,在循环中使用 contains

L = ['fuel', 'pln', 'ayam']
for x in L:
    df.loc[df['Keyword'].str.contains(x), 'Keyword'] = x

或嵌套列表理解:

L = ['fuel', 'pln', 'ayam']
df['Keyword'] = [next(iter([z for z in L if z in x]), x) for x in df['Keyword']]

extract with fillna用原始值替换缺失值:

L = ['fuel', 'pln', 'ayam']
pat = '|'.join(r"\b{}\b".format(x) for x in L)
df['Keyword'] = df['Keyword'].str.extract('('+ pat + ')', expand=False).fillna(df['Keyword'])


print (df)
    Id                    Keyword
0    1                       ayam
1    2                       fuel
2    3   pulsa sms virtualaccount
3    4    biaya koperasi personal
4    5        familymart personal
5    6                        pln
6    7                 biaya onus
7    8          koperasi personal
8    9  biaya familymart personal
9   10                       fuel
10  11                       fuel

如果需要所有匹配值使用 findall with join 并将非空值替换为原始值 loc:

print (df)
   Id                   Keyword
0   1           ayam e-commerce
1   2     biaya fuel pln wallet <- matched 2 keywords
2   3  pulsa sms virtualaccount

pat = '|'.join(r"\b{}\b".format(x) for x in L)
s = df['Keyword'].str.findall('('+ pat + ')').str.join(', ')
df.loc[s != '', 'Keyword'] = s
print (df)
   Id                   Keyword
0   1                      ayam
1   2                 fuel, pln
2   3  pulsa sms virtualaccount