如何在 python 中进行模式匹配时从文本中获取单词大小写

Question

我有一个包含两列 Stg 和 Txt 的数据框。任务是检查每个 Txt 行中 Stg 列中的所有单词，并将匹配的单词输出到新列中，同时保持单词大小写与 Txt 中的大小写相同。

Example Code:

from pandas import DataFrame

new = {'Stg': ['way','Early','phone','allowed','type','brand name'],
        'Txt': ['An early term','two-way allowed','New Phone feature that allowed','amazing universe','new day','the brand name is stage']
        }

df = DataFrame(new,columns= ['Stg','Txt'])

my_list = df["Stg"].tolist()
import re

def words_in_string(word_list, a_string):
    word_set = set(word_list)
    pattern = r'\b({0})\b'.format('|'.join(word_list))
    for found_word in re.finditer(pattern, a_string):
        word = found_word.group(0)
        if word in word_set:
            word_set.discard(word)
            yield word
            if not word_set:
                raise StopIteration 

df['new'] = ''

for i,values in enumerate(df['Txt']):
    a=[]
    b = []
    for word in words_in_string(my_list, values):
        a=word
        b.append(a)
    df['new'][i] = b
    exit

以上代码returns来自Stg列的案例。有没有办法从 Txt.case 中获取案例？此外，我想检查整个字符串，而不是像文本 'two-way'、当前代码 returns 单词方式那样的子字符串。

Current Output:

    Stg            Txt                                   new
0   way           An early term                           []
1   Early         two-way allowed                         [way, allowed]
2   phone         New Phone feature that allowed          [allowed]
3   allowed       amazing universe                        []
4   type          new day                                 []
5   brand name    the brand name is stage                 [brand name]


Expected Output:

    Stg            Txt                                   new
0   way           An early term                           [early]
1   Early         two-way allowed                         [allowed]
2   phone         New Phone feature that allowed          [Phone, allowed]
3   allowed       amazing universe                        []
4   type          new day                                 []
5   brand name    the brand name is stage                 [brand name]

Answer 1

你应该使用 Series.str.findall 负后视：

import pandas as pd
import re

new = {'Stg': ['way','Early','phone','allowed','type','brand name'],
        'Txt': ['An early term','two-way allowed','New Phone feature that allowed','amazing universe','new day','the brand name is stage']
        }

df = pd.DataFrame(new,columns= ['Stg','Txt'])

pattern = "|".join(f"\w*(?<![A-Za-z-;:,/|]){i}\b" for i in new["Stg"])

df["new"] = df["Txt"].str.findall(pattern, flags=re.IGNORECASE)

print (df)

#
          Stg                             Txt               new
0         way                   An early term           [early]
1       Early                 two-way allowed         [allowed]
2       phone  New Phone feature that allowed  [Phone, allowed]
3     allowed                amazing universe                []
4        type                         new day                []
5  brand name         the brand name is stage      [brand name]

如何在 python 中进行模式匹配时从文本中获取单词大小写

how to get the word case from the text while pattern matching in python

python

regex

case-insensitive

pandas