获取与多重和重复正则表达式模式列表相对应的布尔值列表

Question

给定一个正则表达式模式列表，其中包括重复模式和一个输入字符串。

如何获得一个布尔值列表（与正则表达式模式的输入列表长度相同），其中每个布尔值对应于正则表达式列表中的正则表达式模式（在同一索引处）是否与输入字符串匹配？

我想对 pandas 系列或 DataFrame 列中的每个字符串执行此操作。

下面的代码 almost 完全符合我的要求，但它不匹配第二次（或第 n 次）出现的重复正则表达式模式，只匹配第一次。

我想避免使用 for 循环的解决方案。

import pandas as pd

a = pd.Series([
    'one two three four five six seven', 
    'seven six five four three two one twenty ten'])

# list of regex patterns (note: 'one' is duplicated)
pattern_list = ['three', 'one', 'no_match', 'not_in', 'five', 'one']
pattern_single = '(' + ')|('.join(pattern_list) + ')'

pattern_single

'(three)|(one)|(no_match)|(not_in)|(five)|(one)'

((~a.str.extractall(pattern_single).isna())
    .reset_index()
    .groupby('level_0').agg('sum')
    .drop(columns='match')
    .values.tolist())

[[True, True, False, False, True, False],
 [True, True, False, False, True, False]]

当我想要的是：

[[True, True, False, False, True, True],
 [True, True, False, False, True, True]]

我尝试用 ((?<![\w\d])<pattern>(?![\w\d])) 包装由 | 分隔的每个 <pattern>，结果相同。

我试过用 ((?=.*<pattern>)) 包装每个 <pattern>，有和没有分隔 |，这不会捕获任何东西。

我也试过使用如下命名每个组：相同的结果。

pattern_list = ['<{}>{}'.format(chr(k+97), v) for k,v in enumerate(pattern_list)]
pattern_single = '(?P' + ')*|(?P'.join(pattern_list) + ')'

pattern_single

'(?P<a>three)*|(?P<b>one)*|(?P<c>no_match)*|(?P<d>not_in)*|(?P<e>five)*|(?P<f>one)'

Answer 1

正如 Patrick Artner 评论的那样，这不能用正则表达式模式来完成，这里是一个至少给出了我所追求的答案。

稍微更改了输入，以便在有不同匹配的情况下和没有匹配的情况下显示不同的输出。

如果其他人有更有效的定时解决方案，我会接受它作为答案。

import pandas as pd
import numpy as np

b = pd.Series([
    'one two three four five six seven', 
    'there is no match in this example text',
    'seven six five four three one twenty ten',
    'also no matching strings'])

pattern_list = ['three', 'one', 'no_match', 'not_in', 'five', 'one', 'two']
pattern_single = '(' + ')|('.join(pattern_list) + ')'

first_match = (
    (~b.str.extractall(pattern_single).isna())
    .reset_index()
    .groupby('level_0').agg('sum')
    .drop(columns='match')
    .reindex(range(b.size), fill_value=False)
    .values)

first_ptrn_index = [pattern_list.index(x) for x in pattern_list]

indx_mtch = lambda vec: vec[first_ptrn_index]

np.apply_along_axis(indx_mtch, 1, first_match).tolist()

[[True, True, False, False, True, True, True],
 [False, False, False, False, False, False, False],
 [True, True, False, False, True, True, False],
 [False, False, False, False, False, False, False]]

获取与多重和重复正则表达式模式列表相对应的布尔值列表

Get list of Booleans corresponding to list of multiple-and-duplicate regex patterns

python

regex

regex-group

python-3.x