如何 select 并替换字符串 pandas 列中的主要关键字?
How do I select and replace the main keyword in a pandas column of string?
这是我的数据
Id Keyword
1 ayam e-commerce
2 biaya fuel personal wallet
3 pulsa sms virtualaccount
4 biaya koperasi personal
5 familymart personal
6 e-commerce pln
7 biaya onus
8 koperasi personal
9 biaya familymart personal
10 fuel personal wallet
11 fuel travel
我希望每个存在关键字(例如 fuel
、pln
和 ayam
的关键字都缩短为 fuel
、pln
或ayam
,所以输出会像这样
Id Keyword
1 ayam
2 biaya fuel personal wallet
3 pulsa sms virtualaccount
4 biaya koperasi personal
5 familymart personal
6 pln
7 biaya onus
8 koperasi personal
9 biaya familymart personal
10 fuel
11 fuel
我该怎么做?
为了仅替换第一个匹配的词,在循环中使用 contains
:
L = ['fuel', 'pln', 'ayam']
for x in L:
df.loc[df['Keyword'].str.contains(x), 'Keyword'] = x
或嵌套列表理解:
L = ['fuel', 'pln', 'ayam']
df['Keyword'] = [next(iter([z for z in L if z in x]), x) for x in df['Keyword']]
或extract
with fillna
用原始值替换缺失值:
L = ['fuel', 'pln', 'ayam']
pat = '|'.join(r"\b{}\b".format(x) for x in L)
df['Keyword'] = df['Keyword'].str.extract('('+ pat + ')', expand=False).fillna(df['Keyword'])
print (df)
Id Keyword
0 1 ayam
1 2 fuel
2 3 pulsa sms virtualaccount
3 4 biaya koperasi personal
4 5 familymart personal
5 6 pln
6 7 biaya onus
7 8 koperasi personal
8 9 biaya familymart personal
9 10 fuel
10 11 fuel
如果需要所有匹配值使用 findall
with join
并将非空值替换为原始值 loc
:
print (df)
Id Keyword
0 1 ayam e-commerce
1 2 biaya fuel pln wallet <- matched 2 keywords
2 3 pulsa sms virtualaccount
pat = '|'.join(r"\b{}\b".format(x) for x in L)
s = df['Keyword'].str.findall('('+ pat + ')').str.join(', ')
df.loc[s != '', 'Keyword'] = s
print (df)
Id Keyword
0 1 ayam
1 2 fuel, pln
2 3 pulsa sms virtualaccount
这是我的数据
Id Keyword
1 ayam e-commerce
2 biaya fuel personal wallet
3 pulsa sms virtualaccount
4 biaya koperasi personal
5 familymart personal
6 e-commerce pln
7 biaya onus
8 koperasi personal
9 biaya familymart personal
10 fuel personal wallet
11 fuel travel
我希望每个存在关键字(例如 fuel
、pln
和 ayam
的关键字都缩短为 fuel
、pln
或ayam
,所以输出会像这样
Id Keyword
1 ayam
2 biaya fuel personal wallet
3 pulsa sms virtualaccount
4 biaya koperasi personal
5 familymart personal
6 pln
7 biaya onus
8 koperasi personal
9 biaya familymart personal
10 fuel
11 fuel
我该怎么做?
为了仅替换第一个匹配的词,在循环中使用 contains
:
L = ['fuel', 'pln', 'ayam']
for x in L:
df.loc[df['Keyword'].str.contains(x), 'Keyword'] = x
或嵌套列表理解:
L = ['fuel', 'pln', 'ayam']
df['Keyword'] = [next(iter([z for z in L if z in x]), x) for x in df['Keyword']]
或extract
with fillna
用原始值替换缺失值:
L = ['fuel', 'pln', 'ayam']
pat = '|'.join(r"\b{}\b".format(x) for x in L)
df['Keyword'] = df['Keyword'].str.extract('('+ pat + ')', expand=False).fillna(df['Keyword'])
print (df)
Id Keyword
0 1 ayam
1 2 fuel
2 3 pulsa sms virtualaccount
3 4 biaya koperasi personal
4 5 familymart personal
5 6 pln
6 7 biaya onus
7 8 koperasi personal
8 9 biaya familymart personal
9 10 fuel
10 11 fuel
如果需要所有匹配值使用 findall
with join
并将非空值替换为原始值 loc
:
print (df)
Id Keyword
0 1 ayam e-commerce
1 2 biaya fuel pln wallet <- matched 2 keywords
2 3 pulsa sms virtualaccount
pat = '|'.join(r"\b{}\b".format(x) for x in L)
s = df['Keyword'].str.findall('('+ pat + ')').str.join(', ')
df.loc[s != '', 'Keyword'] = s
print (df)
Id Keyword
0 1 ayam
1 2 fuel, pln
2 3 pulsa sms virtualaccount