根据另一列的值从一列中提取模式

Question

给定 pandas 数据框的两列：

import pandas as pd
df = {'word': ['replay','replayed','playable','thinker','think','thoughtful', 'ex)mple'],
      'root': ['play','play','play','think','think','think', 'ex)mple']}
df = pd.DataFrame(df, columns= ['word','root'])

我想提取列 word 的子字符串，其中包括相应列 root 或 NaN 中字符串末尾的所有内容，如果 root 不包含在 word 中。也就是说，生成的数据框如下所示：

word       root    match
replay     play    replay
replayed   play    replay
playable   play    play
thinker    think   think
think      think   think
thoughtful think   NaN
ex)mple    ex)mple ex)mple

我的数据框有几千行，所以我想在必要时避免 for 循环。

Answer 1

您可以在 groupby+apply:

中使用带 str.extract 的正则表达式

import re
df['match'] = (df.groupby('root')['word']
                 .apply(lambda g: g.str.extract(f'^(.*{re.escape(g.name)})'))
               )

或者，如果您希望重复的“根”值很少：

import re
df['match'] = df.apply(lambda r: m.group()
                       if (m:=re.match(f'.*{re.escape(r["root"])}', r['word']))
                       else None, axis=1)

输出：

         word   root   match
0      replay   play  replay
1    replayed   play  replay
2    playable   play    play
3     thinker  think   think
4       think  think   think
5  thoughtful  think     NaN

根据另一列的值从一列中提取模式

Extract pattern from a column based on another column's value

python

extract

pandas