根据另一列的值从一列中提取模式

Extract pattern from a column based on another column's value

给定 pandas 数据框的两列:

import pandas as pd
df = {'word': ['replay','replayed','playable','thinker','think','thoughtful', 'ex)mple'],
      'root': ['play','play','play','think','think','think', 'ex)mple']}
df = pd.DataFrame(df, columns= ['word','root'])

我想提取列 word 的子字符串,其中包括相应列 rootNaN 中字符串末尾的所有内容,如果 root 不包含在 word 中。也就是说,生成的数据框如下所示:

word       root    match
replay     play    replay
replayed   play    replay
playable   play    play
thinker    think   think
think      think   think
thoughtful think   NaN
ex)mple    ex)mple ex)mple

我的数据框有几千行,所以我想在必要时避免 for 循环。

您可以在 groupby+apply:

中使用带 str.extract 的正则表达式
import re
df['match'] = (df.groupby('root')['word']
                 .apply(lambda g: g.str.extract(f'^(.*{re.escape(g.name)})'))
               )

或者,如果您希望重复的“根”值很少:

import re
df['match'] = df.apply(lambda r: m.group()
                       if (m:=re.match(f'.*{re.escape(r["root"])}', r['word']))
                       else None, axis=1)

输出:

         word   root   match
0      replay   play  replay
1    replayed   play  replay
2    playable   play    play
3     thinker  think   think
4       think  think   think
5  thoughtful  think     NaN