Python 变量字符串之间的正则表达式和内容之间的检查

Python regex expression between variable strings and content check between

我想找到出现在列表元素之间的所有字符串 start_signsend_signs。当 end_signs 中的元素丢失或稍后出现上下文时, 该解决方案不应采取。

一个解决方案是获取 start_signsend_signs 之间的所有匹配项 并检查匹配项是否仅包含第三个列表中的单词 allowed_words_between.

import re

allowed_words_between = ["and","with","a","very","beautiful"]

start_signs           = ["$","$$"]
end_signs             = ["Ferrari","BMW","Lamborghini","ship"]

teststring = """
             I would like to be a $-millionaire with a Ferrari.                                     -> Match: $-millionaire with a Ferrari
             I would like to be a $$-millionair with a Lamborghini.                                 -> Match: $$-millionair with a Lamborghini
             I would like to be a $$-millionair with a rotten Lamborghini.                          -> No Match because of the word "rotten"
             I would like to be a $$-millionair with a Lamborghini and a Ferrari.                   -> Match: $$-millionair with a Lamborghini and a Ferrari
             I would like to be a $-millionaire with a very, very beautiful ship!                   -> Match: $-millionaire with a very, very beautiful ship
             I would like to be a $-millionaire with a very, very beautiful but a bit dirty ship.                       -> No Match because of the word dirty
             I would like to be a $-millionaire with a dog, a cat, two children and a cowboy hat. That would be great.   -> No Match
             """

另一种解决方案是以 start_signs 开头的字符串,并在出现未出现在允许列表中的字符串时立即将其剪切:

allowed_list = allowed_words_between + start_signs + end_signs

到目前为止我尝试了什么:

我使用了this post

的解决方案
regexString = "("+"|".join(start_signs) + ")" + ".*?" + "(" +"|".join(end_signs)+")" 

并尝试创建一个可变的正则表达式字符串 w.r.t。开始和结束。那不是不起作用。 我也不知道内容检查如何工作。

matches          = re.findall(regexString,teststring)
substituted_text = re.sub(regexString, "[[Found It]]", teststring, count=0)

您可以重复所有 allowed_words_between(可选)后跟逗号和空格字符,直到到达 end_signs.

之一

您可以将捕获组变为非捕获 (?: 否则 re.findall 将 return 捕获组值。

注意转义 $ 以字面匹配

图案看起来像

(?:$|$$)\S*(?:(?:\s+(?:and|with|a|very|beautiful),?)*\s+(?:Ferrari|BMW|Lamborghini|ship))+

模式匹配

  • (?:$|$$)\S* 匹配任何 start_signs 后跟可选的非空白字符(\S 也可以匹配美元符号,但是你可以使它更具体,例如 -\w+)
  • (?: 外层非捕获组
    • (?: 内部非捕获组
      • \s+(?:and|with|a|very|beautiful),? 匹配任何 allowed_words_between 后跟一个逗号
    • )*\s+ 关闭内部非捕获组并重复 0+ 次后跟 1+ whitspace 字符
    • (?:Ferrari|BMW|Lamborghini|ship) 匹配任何 end_signs
  • )+ 关闭外部非捕获组并重复 1+ 次以将字符串与 Lamborghini 和法拉利
  • 匹配

Regex demo | Python demo

import re

allowed_words_between = ["and", "with", "a", "very", "beautiful"]
start_signs = [r"$", "$$"]
end_signs = ["Ferrari", "BMW", "Lamborghini", "ship"]
teststring = """
             I would like to be a $-millionaire with a Ferrari.
             I would like to be a $$-millionair with a Lamborghini.
             I would like to be a $$-millionair with a rotten Lamborghini.
             I would like to be a $$-millionair with a Lamborghini and a Ferrari.
             I would like to be a $-millionaire with a very, very beautiful ship!
             I would like to be a $-millionaire with a very, very beautiful but a bit dirty ship.
             I would like to be a $-millionaire with a dog, a cat, two children and a cowboy hat. That would be great.
             """
regexString = "(?:" + "|".join(start_signs) + ")\S*(?:(?:\s+(?:" + "|".join(allowed_words_between) + "),?)*\s+(?:" + "|".join(end_signs) + "))+"

for s in re.findall(regexString, teststring):
    print(s)

输出

$-millionaire with a Ferrari
$$-millionair with a Lamborghini
$$-millionair with a Lamborghini and a Ferrari
$-millionaire with a very, very beautiful ship