具有多个组的正则表达式，这些组使用前瞻性逻辑 AND 内部正则表达式组

Question

如何获得一个布尔值列表，该列表指示找到的每个组的匹配项，其中对其中一个组内的 'AND' 使用正前瞻？我只想为每个组使用一个布尔值 returned。

示例：我想为以下字符串 'one two three' 获取 [True, True] returned 的列表。

[bool(x) for x in re.findall('(one)|((?=.*three)(?=.*two))', 'one two three')]

给出：[True, True, True]

[bool(x) for x in re.findall('(one)(?=.*three)(?=.*two)', 'one two three')]

给出：[True]

[bool(x) for x in re.findall('(one)|(?=.*three)(?=.*two)', 'one two three')]

给出：[True, False, False]

我要[True, True]

也就是说，当 'two' AND 'three' 在字符串中以任意顺序出现时，给出第二个也是最后一个 True。

编辑澄清：

用简单的语言来说，我想要一个模式，该模式可以为模式中的每个组 return True（找到模式）或 False（未找到模式）。我需要在组内使用逻辑 AND，这样组内由 AND 分隔的模式顺序无关紧要，只是必须为整个组找到每个模式才能标记为 True。

因此，使用 () 作为组指标，"pattern" (one) , (three AND two)

对于字符串 'one two three'，我会得到 [True, True]
对于字符串 'one three two'，我会得到 [True, True]
对于字符串 'two three one'，我会得到 [True, True]
对于字符串 'one three ten'，我会得到 [True, False]
对于字符串 'ten three two'，我会得到 [False, True]

python中的re.findall()或re.findinter()，或Pandas中的pd.Series.str.extractall()return各'group' .使用其中之一，我可以使用正则表达式或 '|' 来分隔组并为每个 'group' 得到一些 returned it "finds" （字符串本身）或执行 "not find"（空字符串或 nan），然后可以将其转换为 True 或 False。

For 循环可以工作，但我的实际用例有数十万个字符串和数千个搜索列表，每个搜索列表都有 10-20 种模式可在每个字符串上循环。完成这些 for 循环（对于每个字符串：对于每个模式列表：对于每个模式）非常慢。我正在尝试将模式列表组合成一个模式并获得相同的结果。

我在 Pandas 中使用 str.extractall() 进行了这项工作。我只是无法在捕获 'group' 的内部中使用逻辑 AND 。这是我唯一坚持的事情，也是这个问题的基础。

Pandas 代码类似于：

import pandas as pd
ser = pd.Series(['one two three']) 
(~ser.str.extractall('(one)|(?=.*three)(?=.*two)').isna()).values.tolist()

Returns：[[True], [False], [False]]，它可以很容易地折叠成布尔列表而不是列表列表，但是，这与我上面显示的问题相同。

Answer 1

我的猜测是您希望设计类似于以下内容的表达式：

[bool(x) for x in re.findall(r'^(?:one\b.*?)\b(two|three)\b|\b(three|two)\b.*$', 'one three two')]

不确定或可能：

search = ['two','three']
string_to_search = 'one two three'

output = []
for word in search:
    if word in string_to_search:
        output.append(True)

print(output)

输出

[True, True]

Answer 2

下一行使用 re.finditer 而不是 re.findall。此外，无论顺序如何，当 two 和 three 都存在时，正则表达式最后需要一个 .+ 才能匹配整个字符串。

[bool(x) for x in re.finditer('(one)|(?=.*two)(?=.*three).+', 'one three two')]

这也适用于 one three two four，如操作评论之一所述，无需声明所有可能的排列。

[bool(x) for x in re.finditer('(one)|(?=.*two)(?=.*three)(?=.*four).+', 'one two four three')]

Answer 3

我们可以简单地通过命名捕获组来解决这个问题。我只是将模式分成两部分。检查第一部分和第二部分是否存在，如果存在则 return True 对应部分 else return False.

>>> def findstr(x):
    first = second = False
    matches = re.finditer(r'(?P<first>one)|(?=.*(?P<second>three))(?=.*two)', x)
    for match in matches:
        if match.group('first'):
            first = True
        elif match.group('second'):
            second = True
    return [first, second]

>>> str_lst = ['one two three', 'one three two', 'two three one', 'one three ten', 'ten three two']
>>> for stri in str_lst:
    print(findstr(stri))


[True, True]
[True, True]
[True, True]
[True, False]
[False, True]
>>>

请注意，仅当 two 和 three 都存在于字符串中时，才会捕获第二组。查看下面的演示以进行说明。

DEMO

Answer 4

Avinash Raj's 回答让我得到了正确的结果。具体来说，命名具有 'AND' 正则表达式构造分隔模式的模式组中的第一个模式，并命名所有其他模式。所以我选择了那个答案。

下面是我的特定用例后的通用示例。

import pandas as pd
import numpy as np

regex_list = [['one'],['three','two'], ['four'], ['five', 'six', 'seven']]

def regex_single_make(regex_list):
    tmplist = []
    for n,l in enumerate(regex_list):
        if len(l) == 1:
            tmplist.append(r'(?P<_{}>\b{}\b)'.format(n, l[0]))
        else:
            tmplist.append(
                ''.join(
                    [r'(?=.*(?P<_{}>\b{}\b))'.format(n, v)
                    if k == 0 
                    else r'(?=.*\b{}\b)'.format(v)
                    for k,v in enumerate(l)]))
    return '|'.join(tmplist)

regex_single_make(regex_list)

regex_single

'(?P<_0>\bone\b)|(?=.*(?P<_1>\bthree\b))(?=.*\btwo\b)|(?P<_2>\bfour\b)|(?=.*(?P<_3>\bfive\b))(?=.*\bsix\b)(?=.*\bseven\b)'

b = pd.Series([
    'one two three four five six seven', 
    'there is no match in this example text',
    'seven six five four three one twenty ten',
    'except four, no matching strings',
    'no_one, three AND two, no_four, five AND seven AND six'])

match_lists = (np.apply_along_axis(
        lambda vec: vec[[regex_list.index(x) for x in regex_list]], 1, (
        (~b.str.extractall(regex_single).isna())
        .reset_index()
        .groupby('level_0').agg('sum')
        .drop(columns='match')
        .reindex(range(b.size), fill_value=False)
        .values > 0 )
    ).tolist())

match_lists

[[True, True, True, True],
 [False, False, False, False],
 [True, False, True, True],
 [False, False, True, False],
 [False, True, False, True]]

具有多个组的正则表达式，这些组使用前瞻性逻辑 AND 内部正则表达式组

Regex with multiple groups that use lookahead for logical AND inside regex group

regex

regex-group

python-3.x

编辑澄清：

输出