正则表达式中的通配符仅在停用词之前是贪婪的

Wildcard in regular expression that is greedy only until a stop word

我正在尝试构建一个 'simple' 正则表达式(在 java 中)来匹配如下句子:

I want to cook something
I want to cook something with chicken and cheese
I want to cook something with chicken but without onions
I want to cook something without onions but with chicken and cheese
I want to cook something with candy but without nuts within 30 minutes

在最好的情况下,它也应该匹配: I want to cook something with candy and without nuts within 30 minutes

在这些示例中,我想捕获 'included' 成分、'excluded' 成分和烹饪程序的最大值 'duration'。正如您所看到的,这 3 个捕获组中的每一个在模式中都是可选的,每个都以一个特定的单词开头(with,(but)?without,within)并且组应该使用通配符匹配直到找到下一个特定关键字.此外,这些成分可以包含多个单词,因此在 second/third 示例中,"chicken and cheese" 应与命名捕获组 'included'.

匹配

在最好的情况下,我想写一个类似于这个的模式:

I want to cook something ((with (?<include>.+))|((but )?without (?<exclude>.+))|(within (?<duration>.+) minutes))*

显然这不起作用,因为这些通配符也可以与关键字匹配,因此在第一个关键字匹配后,其他所有内容(包括更多关键字)都将与相应命名捕获组的贪婪通配符匹配。

我尝试使用前瞻,例如这样的事情:

something ((with (?<IncludedIngredients>.*(?=but)))|(but )?without (?<ExcludedIngredients>.+))+

该正则表达式可识别 something with chicken but without onions 但不匹配 something with chicken

是否有一个简单的解决方案可以在正则表达式中执行此操作?

P.S。 'Simple' 解决方案意味着我不必在一个句子中指定这些关键字的所有可能组合并按每个组合中使用的关键字数量对它们进行排序。

大概可以归结为下面的构造。

(?m)^I[ ]want[ ]to[ ]cook[ ]something(?=[ ]|$)(?<Order>(?:(?<with>\b(?:but[ ])?with[ ](?:(?!(?:\b(?:but[ ])?with(?:in|out)?\b)).)*)|(?<without>\b(?:but[ ])?without[ ](?:(?!(?:\b(?:but[ ])?with(?:in|out)?\b)).)*)|(?<time>\bwithin[ ](?<duration>.+)[ ]minutes[ ]?)|(?<unknown>(?:(?!(?:\b(?:but[ ])?with(?:in|out)?\b)).)+))*)$

https://regex101.com/r/RHfGnb/1

展开

 (?m)
 ^ I [ ] want [ ] to [ ] cook [ ] something
 (?= [ ] | $ )
 (?<Order>                      # (1 start)
      (?:
           (?<with>                      # (2 start)
                \b
                (?: but [ ] )?
                with [ ]
                (?:
                     (?!
                          (?:
                               \b
                               (?: but [ ] )?
                               with
                               (?: in | out )?
                               \b
                          )
                     )
                     .
                )*
           )                             # (2 end)
        |  (?<without>                   # (3 start)
                \b
                (?: but [ ] )?
                without [ ]
                (?:
                     (?!
                          (?:
                               \b
                               (?: but [ ] )?
                               with
                               (?: in | out )?
                               \b
                          )
                     )
                     .
                )*
           )                             # (3 end)
        |  (?<time>                      # (4 start)
                \b within [ ]
                (?<duration> .+ )             # (5)
                [ ] minutes [ ]? 
           )                             # (4 end)
        |  (?<unknown>                   # (6 start)
                (?:
                     (?!
                          (?:
                               \b
                               (?: but [ ] )?
                               with
                               (?: in | out )?
                               \b
                          )
                     )
                     .
                )+
           )                             # (6 end)
      )*
 )                             # (1 end)
 $

你的模式不错。一旦您将量词的贪婪性质确定为问题所在,只需考虑将它们更改为不情愿,即将 .+ 替换为 .+?:

String[] examples = {
    "I want to cook something",
    "I want to cook something with chicken and cheese",
    "I want to cook something with chicken but without onions",
    "I want to cook something without onions but with chicken and cheese",
    "I want to cook something with candy but without nuts within 30 minutes" };

Pattern p = Pattern.compile("I want to cook something"
    + "((( but)? with (?<include>.+?))|(( but)? without (?<exclude>.+?))"
        + "|( within (?<duration>.+?) minutes))*");

for(String s: examples) {
    Matcher m = p.matcher(s);
    if(m.matches()) {
        System.out.println(s);
        if(m.start("include") >= 0) System.out.println("\tinclude: "+m.group("include"));
        if(m.start("exclude") >= 0) System.out.println("\texclude: "+m.group("exclude"));
        if(m.start("duration") >= 0) System.out.println("\tduration: "+m.group("duration"));
    }
}
I want to cook something
I want to cook something with chicken and cheese
    include: chicken and cheese
I want to cook something with chicken but without onions
    include: chicken
    exclude: onions
I want to cook something without onions but with chicken and cheese
    include: chicken and cheese
    exclude: onions
I want to cook something with candy but without nuts within 30 minutes
    include: candy
    exclude: nuts
    duration: 30

唯一需要更改的是向 with 添加一个可选的 but,以允许 without … but with 和 space 的位置匹配"I want to cook something" 没有尾随 space。