根据另一列的值从一列中提取模式
Extract pattern from a column based on another column's value
给定 pandas 数据框的两列:
import pandas as pd
df = {'word': ['replay','replayed','playable','thinker','think','thoughtful', 'ex)mple'],
'root': ['play','play','play','think','think','think', 'ex)mple']}
df = pd.DataFrame(df, columns= ['word','root'])
我想提取列 word
的子字符串,其中包括相应列 root
或 NaN
中字符串末尾的所有内容,如果 root
不包含在 word
中。也就是说,生成的数据框如下所示:
word root match
replay play replay
replayed play replay
playable play play
thinker think think
think think think
thoughtful think NaN
ex)mple ex)mple ex)mple
我的数据框有几千行,所以我想在必要时避免 for 循环。
您可以在 groupby
+apply
:
中使用带 str.extract
的正则表达式
import re
df['match'] = (df.groupby('root')['word']
.apply(lambda g: g.str.extract(f'^(.*{re.escape(g.name)})'))
)
或者,如果您希望重复的“根”值很少:
import re
df['match'] = df.apply(lambda r: m.group()
if (m:=re.match(f'.*{re.escape(r["root"])}', r['word']))
else None, axis=1)
输出:
word root match
0 replay play replay
1 replayed play replay
2 playable play play
3 thinker think think
4 think think think
5 thoughtful think NaN
给定 pandas 数据框的两列:
import pandas as pd
df = {'word': ['replay','replayed','playable','thinker','think','thoughtful', 'ex)mple'],
'root': ['play','play','play','think','think','think', 'ex)mple']}
df = pd.DataFrame(df, columns= ['word','root'])
我想提取列 word
的子字符串,其中包括相应列 root
或 NaN
中字符串末尾的所有内容,如果 root
不包含在 word
中。也就是说,生成的数据框如下所示:
word root match
replay play replay
replayed play replay
playable play play
thinker think think
think think think
thoughtful think NaN
ex)mple ex)mple ex)mple
我的数据框有几千行,所以我想在必要时避免 for 循环。
您可以在 groupby
+apply
:
str.extract
的正则表达式
import re
df['match'] = (df.groupby('root')['word']
.apply(lambda g: g.str.extract(f'^(.*{re.escape(g.name)})'))
)
或者,如果您希望重复的“根”值很少:
import re
df['match'] = df.apply(lambda r: m.group()
if (m:=re.match(f'.*{re.escape(r["root"])}', r['word']))
else None, axis=1)
输出:
word root match
0 replay play replay
1 replayed play replay
2 playable play play
3 thinker think think
4 think think think
5 thoughtful think NaN