无法使用正则表达式为 pandas 中的值集找到第一次出现的子字符串
Unable to find the first occurrence of substring using regex for set of values in pandas
我有一个如下所示的数据框,我只需要找到字符串中第一次出现的值集。
我无法将 "find" 函数与正则表达式和字典一起使用。如果我使用 "findall" 函数,当然会发现所有不是我需要的事件。
Text
51000/1-PLASTIC 150 Prange
51034/2-RUBBER KL 100 AA
51556/3-PAPER BD+CM 1 BOXT2
52345/1-FLOW IJ 10place 500 plastic
54975/1-DIVIDER PQR 100 BC
54975/1-SCALE DEF 555 AB Apple
54975/1-PLASTIC ABC 4.6 BB plastic
代码:
import re
L = ['PLASTIC','RUBBER','PAPER','FLOW']
pat = '|'.join(r"\b{}\b".format(x) for x in L)
df['Result'] = df['Text'].str.find(pat, flags=re.I).str.join(' ')
print(df)
df = df.replace(r'^\s*$', np.nan, regex=True)
df = df.replace(np.nan, "Not known", regex=True)
#df['Result'] = df['Result'].str.lower()
预期结果:
Text Result
51000/1-PLASTIC 150 Prange Plastic
51034/2-RUBBER KL 100 AA Rubber
51556/3-PAPER BD+CM 1 BOXT2 Paper
52345/1-FLOW IJ 10place 500 plastic Flow
54975/1-DIVIDER PQR 100 BC Not known
54975/1-SCALE DEF 555 AB Apple Not KNown
54975/1-PLASTIC ABC 4.6 BB plastic Plastic
错误:
TypeError: find() got an unexpected keyword argument 'flags'
使用 Series.str.findall
而不是 find
和 select 通过索引 str[0]
:
返回的 findall
列表的第一个值
import re
L = ['PLASTIC','RUBBER','PAPER','FLOW']
pat = '|'.join(r"\b{}\b".format(x) for x in L)
df['Result'] = df['Text'].str.findall(pat, flags=re.I).str[0]
df['Result'] = df['Text'].str.extract('(' + pat + ')', flags=re.I)
然后将缺失值转换为Not known
:
df['Result'] = df['Result'].fillna("Not known")
如有必要最后使用Series.str.capitalize
:
df['Result'] = df['Result'].str.capitalize()
print (df)
Text Result
0 51000/1-PLASTIC 150 Prange Plastic
1 51034/2-RUBBER KL 100 AA Rubber
2 51556/3-PAPER BD+CM 1 BOXT2 Paper
3 52345/1-FLOW IJ 10place 500 plastic Flow
4 54975/1-DIVIDER PQR 100 BC Not known
5 54975/1-SCALE DEF 555 AB Apple Not known
6 54975/1-PLASTIC ABC 4.6 BB plastic Plastic
我有一个如下所示的数据框,我只需要找到字符串中第一次出现的值集。
我无法将 "find" 函数与正则表达式和字典一起使用。如果我使用 "findall" 函数,当然会发现所有不是我需要的事件。
Text
51000/1-PLASTIC 150 Prange
51034/2-RUBBER KL 100 AA
51556/3-PAPER BD+CM 1 BOXT2
52345/1-FLOW IJ 10place 500 plastic
54975/1-DIVIDER PQR 100 BC
54975/1-SCALE DEF 555 AB Apple
54975/1-PLASTIC ABC 4.6 BB plastic
代码:
import re
L = ['PLASTIC','RUBBER','PAPER','FLOW']
pat = '|'.join(r"\b{}\b".format(x) for x in L)
df['Result'] = df['Text'].str.find(pat, flags=re.I).str.join(' ')
print(df)
df = df.replace(r'^\s*$', np.nan, regex=True)
df = df.replace(np.nan, "Not known", regex=True)
#df['Result'] = df['Result'].str.lower()
预期结果:
Text Result
51000/1-PLASTIC 150 Prange Plastic
51034/2-RUBBER KL 100 AA Rubber
51556/3-PAPER BD+CM 1 BOXT2 Paper
52345/1-FLOW IJ 10place 500 plastic Flow
54975/1-DIVIDER PQR 100 BC Not known
54975/1-SCALE DEF 555 AB Apple Not KNown
54975/1-PLASTIC ABC 4.6 BB plastic Plastic
错误:
TypeError: find() got an unexpected keyword argument 'flags'
使用 Series.str.findall
而不是 find
和 select 通过索引 str[0]
:
findall
列表的第一个值
import re
L = ['PLASTIC','RUBBER','PAPER','FLOW']
pat = '|'.join(r"\b{}\b".format(x) for x in L)
df['Result'] = df['Text'].str.findall(pat, flags=re.I).str[0]
df['Result'] = df['Text'].str.extract('(' + pat + ')', flags=re.I)
然后将缺失值转换为Not known
:
df['Result'] = df['Result'].fillna("Not known")
如有必要最后使用Series.str.capitalize
:
df['Result'] = df['Result'].str.capitalize()
print (df)
Text Result
0 51000/1-PLASTIC 150 Prange Plastic
1 51034/2-RUBBER KL 100 AA Rubber
2 51556/3-PAPER BD+CM 1 BOXT2 Paper
3 52345/1-FLOW IJ 10place 500 plastic Flow
4 54975/1-DIVIDER PQR 100 BC Not known
5 54975/1-SCALE DEF 555 AB Apple Not known
6 54975/1-PLASTIC ABC 4.6 BB plastic Plastic