检查字符串是否与 python 中的模式匹配的最有效方法?

Most efficient way of checking if a string matches a pattern in python?

我有一个字符串格式可以被别人改变(随便说说)

sample = f"This is a {pet} it has {number} legs"

我目前有两个字符串

a = "This is a dog it has 4 legs"
b = "This was a dog"

如何检查哪个字符串满足这种sample格式? 我可以在 sample 上使用 python 的字符串 replace() 并创建它的正则表达式并使用 re.match 检查。 但要注意的是 sample 可以更改,因此静态使用 replace 并不总是有效,因为 sample 可能会得到更多的占位符。

试试这个。

sample = "This is a {pet} it has {number} legs"

def check(string):
    patt = sample.split(' ')
    index = [i for i,v in enumerate(patt) if '{' in v and '}' in v]
    if all(True if v==patt[i] or i in index else False for i,v in enumerate(string.split(' '))):
        print(f'string matches the pattern')
    else:
        print(f"string does not match the pattern")

a = "This is a dog it has 4 legs"
b = "This was a dog"
check(a) # string matches the pattern

首先,如果你想匹配一个模板字符串,不要使用 f'' 字符串前缀,否则它会被立即评估。而只是像这样写格式字符串:

sample = 'This is a {pet} it has {number} legs'

这是我为一个项目编写的用于解析格式字符串并将其转换为正则表达式的函数:

import re
import string


def format_to_re(format_str, **kwargs):
    r"""
    Convert a format string to a regular expression, such that any format
    fields may replaced with regular expression syntax, and any literals are
    properly escaped.

    As a special case, if a 2-tuple is given for the value of a field, the
    first time the field appears in the format string the first element of the
    tuple is used as the replacement, and the second element is used for all
    subsequence replacements.

    Examples
    --------

    This example uses a backslash just to add a little Windows flavor:

    >>> filename_format = \
    ...     r'scenario_{scenario}\{name}_{scenario}_{replicate}.npz'
    >>> filename_re = format_to_re(filename_format,
    ...     scenario=(r'(?P<scenario>0*\d+)', r'0*\d+'),
    ...     replicate=r'0*\d+', name=r'\w+')
    >>> filename_re
    'scenario_(?P<scenario>0*\d+)\\\w+_0*\d+_0*\d+\.npz'
    >>> import re
    >>> filename_re = re.compile(filename_re)
    >>> filename_re
    re.compile(...)

    This regular expression can be used to match arbitrary filenames to
    determine whether or not they are in the format specified by the original
    ``filename_format`` template, as well as to extract the values of fields by
    using groups:

    >>> match = filename_re.match(r'scenario_000\my_model_000_000.npz')
    >>> match is not None
    True
    >>> match.group('scenario')
    '000'
    >>> filename_re.match(r'scenario_000\my_model_garbage.npz') is None
    True
    """

    formatter = string.Formatter()
    new_format = []
    seen_fields = set()

    for item in formatter.parse(format_str):
        literal, field_name, spec, converter = item
        new_format.append(re.escape(literal))

        if field_name is None:
            continue

        replacement = kwargs[field_name]

        if isinstance(replacement, tuple) and len(replacement) == 2:
            if field_name in seen_fields:
                replacement = replacement[1]
            else:
                replacement = replacement[0]

        new_format.append(replacement)
        seen_fields.add(field_name)

    return ''.join(new_format)

您可以在示例中使用它,例如:

>>> sample_re = format_to_re(sample, pet=r'(?P<pet>.+)', number=r'(?P<number>\d+)')
>>> sample_re = re.compile(sample_re)
>>> sample_re
re.compile('This\ is\ a\ (?P<pet>.+)\ it\ has\ (?P<number>\d+)\ legs')
>>> m = sample_re.match('This is a dog it has 4 legs')
>>> m.groupdict()
{'pet': 'dog', 'number': '4'}

根据您的用例,您可以稍微简化它。原始版本是为了处理一些 application-specific 个案例。

另一个可能的改进是,给定一个任意格式字符串,为其中找到的每个字段提供默认正则表达式,可能由字段中的任何格式说明符决定。

当你运行:

sample = f"This is a {pet} it has {number} legs"

样本没有有任何占位符

示例是字符串 "This is a xxx it has yyy legs",其中 xxxyyy 已被替换。因此,除非您知道哪些是参数,否则您无能为力。

如果您想要占位符,请不要使用 f-string:

sample = "This is a {pet} it has {number} legs"
formatted_string = sample.format(**{'pet': 'dog', 'number': '4'})
# "This is a dog it has 4 legs"

然后您可以 运行 像这样:

import string
from operator import itemgetter

sample = "This is a {pet} it has {number} legs"

keys = {k: r'\w+' for k in filter(None,
        map(itemgetter(1), string.Formatter().parse(sample)))}
# {'pet': '\w+', 'number': '\w+'}

regex = re.compile(sample.format(**keys))


a = "This is a dog it has 4 legs"
b = "This was a dog"
regex.match(a)
# <re.Match object; span=(0, 27), match='This is a dog it has 4 legs'>

regex.match(b)
# None

一个简单的提取对象的小方法是

import re

patt = re.compile(r'This is a (.+) it has (\d+) legs',)

a = "This is a dog it has 4 legs"
b = "This was a dog"
match = patt.search(a)
print(match.group(1), match.group(2))

我喜欢这些方法,但我找到了两个线性解决方案: (我不知道这个的性能方面,但它有效!)


def pattern_match(input, pattern):
    regex = re.sub(r'{[^{]*}','(.*)', "^" + pattern + "$")
    if re.match(regex, input):
        print(f"'{input}' matches the pattern '{pattern}'")

pattern_match(a, sample)
pattern_match(b, sample)

输出

'This is a dog it has 4 legs' matches the pattern 'This is a {pet} it has {number} legs'