REGEX 捕获一个句子的两个单词之间的每个 n 个字母的单词

REGEX capture each n-letters words between two words of a sentence

我很难尝试 select 句子的两个单词之间只有 n 个长度的单词: 前任: 对于声明: “这是开始,一些单词将 selected 结束,不再 select”

假设我想 select 单词 'start' 和 'end' 之间的 3+ 个单词,结果将捕获 一些单词 selected 忽略了并且是。

https://regex101.com/r/Ost7Wn/3

只是 selecting [\w]{3,} 本身就可以工作,但我不知道如何将它放在 'start' 和 'end' 之间匹配我的 n 字母单词的句子只出现在它们之间。我尝试了很多东西,从环顾四周到捕获组,但我真的做不到!

有什么想法吗?谢谢

这是一个有趣的场景。通常,您最好从主源中提取字符串 start(.*)end,然后在子字符串上 运行 您的正则表达式。

但这并不意味着不可能使用一个 RegEx!

我相信您已经为 positive|negative lookahead|lookbehind 苦苦挣扎,并且发现这真的很麻烦,您不能进行动态长度回顾,例如(?<=start.*)lookahead.

一样

对于此示例,您必须了解的关键是 RegEx 在字符串匹配时移动 cursor 位置...这是我们将用来完成此工作的警告。

正则表达式

(?:.*start|^.*|end.*)|\b(\w{3,})(?=.*end)
(?:                                         : Start of non-capture group
   .*start                                  : [Match pattern 1a] matches anything upto and including the {start} anchor
          |                                 : OR operator
           ^.*                              : [Match pattern 1b] matches from the {^} start of the string to the end
              |                             : OR operator
               end.*                        : [Match pattern 1c] matches from the {end} anchor to the end of the string
                    )                       : End of non-capture group
                     |                      : OR operator
                      \b(\w{3,})(?=.*end)   : Captures a boundary [\b] followed by word characters [a-zA-Z0-9_] 3 or more times whilst using a positive lookahead to check that the {end} anchor hasn't been passed

罗嗦解释

上面的RegEx可以写成,简单来说:

    NON-CAPTURING_GROUP OR CAPTURING_GROUP
OR, more verbose
    (MATCH_PATTERN_1a OR MATCH_PATTERN_1b OR MATCH_PATTERN_1c) OR MATCH_PATTERN_2
  • NON-CAPTURING_GROUP 总是先求值,所以我们在这里检查我们是否真的想要匹配
    • MATCH_PATTERN_1a 检查 start 锚点是否存在并将 cursor 移动到字符串中的那个点
    • MATCH_PATTERN_1b 仅在 1a 失败且字符串中存在 start 锚点时匹配。如果是这样,它匹配所有内容并且表达式停止。
    • MATCH_PATTERN_1c 检查是否未到达 end 锚点。如果有,则匹配到字符串的末尾,表达式停止。
  • CAPTURING_GROUP 总是排在第二位;所以只匹配如果它应该
    • MATCH_PATTERN_2 匹配指定长度之间后跟单词字符 [a-zA-Z0-9_] 的任何单词边界
      • 它还会检查 positive lookahead 以确保未通过 end 锚点

警告

请注意,第一个和最后一个捕获将始终来自 NON-CAPTURE 组,应忽略。根据正则表达式的实现方式,它可能为空、完整匹配字符串或两者兼有(multi-dimensional 数组)。

示例[Python]

注意:Python$result = $full_matches[] 格式输出 注意:flag = re.I 已设置为使 RegEx 不区分大小写,即它匹配 STARTstart

import re

test_str1 = """one two three four START four five two five six END seven"""
test_str2 = """this is the start some words are to be selected end no more select"""
test_str3 = """these are some words that shouldn't be selected end also not selected"""
test_str4 = """end two four five two five six END seven"""
test_str5 = """one start two three end four five six end seven one"""
test_str6 = """END START two four five two five six seven"""

regex1 = r"(?:.*start|^.*|end.*)|\b(\w{3,})(?=.*end)"
regex2 = r"(?:.*start|^.*|end.*)|\b(\w{4,})(?=.*end)"

print(re.findall(regex1, test_str1, re.I))
print(re.findall(regex1, test_str2, re.I))
print(re.findall(regex1, test_str3, re.I))
print(re.findall(regex1, test_str4, re.I))
print(re.findall(regex1, test_str5, re.I))
print(re.findall(regex1, test_str6, re.I))

print(re.findall(regex2, test_str1, re.I))      
print(re.findall(regex2, test_str2, re.I))
print(re.findall(regex2, test_str3, re.I))
print(re.findall(regex2, test_str4, re.I))
print(re.findall(regex2, test_str5, re.I))
print(re.findall(regex2, test_str6, re.I))

'''
  Output:
    ['', 'four', 'five', 'two', 'five', 'six', '']
    ['', 'some', 'words', 'are', 'selected', '']
    ['']
    ['']
    ['', 'two', 'three', '']
    ['']
    ['', 'four', 'five', 'five', '']
    ['', 'some', 'words', 'selected', '']
    ['']
    ['']
    ['', 'three', '']
    ['']
'''

示例[PHP]

注意:PHP$result = [$full_matches[], $capture_group[]] 格式输出 注意:flag = i 已设置为使 RegEx 不区分大小写,即它匹配 STARTstart

$test_str1 = "one two three four START four five two five six END seven";
$test_str2 = "this is the start some words are to be selected end no more select";
$test_str3 = "these are some words that shouldn't be selected end also not selected";
$test_str4 = "end two four five two five six END seven";
$test_str5 = "one start two three end four five six end seven one";
$test_str6 = "END START two four five two five six seven";

$regex1 = "/(?:.*start|^.*|end.*)|\b(\w{3,})(?=.*end)/i";
$regex2 = "/(?:.*start|^.*|end.*)|\b(\w{4,})(?=.*end)/i";

preg_match_all($regex1, $test_str1, $matches1);
preg_match_all($regex1, $test_str2, $matches2);
preg_match_all($regex1, $test_str3, $matches3);
preg_match_all($regex1, $test_str4, $matches4);
preg_match_all($regex1, $test_str5, $matches5);
preg_match_all($regex1, $test_str6, $matches6);

preg_match_all($regex2, $test_str1, $matches7);
preg_match_all($regex2, $test_str2, $matches8);
preg_match_all($regex2, $test_str3, $matches9);
preg_match_all($regex2, $test_str4, $matches10);
preg_match_all($regex2, $test_str5, $matches11);
preg_match_all($regex2, $test_str6, $matches12);

echo json_encode($matches1);
echo "\n";
echo json_encode($matches2);
echo "\n";
echo json_encode($matches3);
echo "\n";
echo json_encode($matches4);
echo "\n";
echo json_encode($matches5);
echo "\n";
echo json_encode($matches6);
echo "\n";
echo json_encode($matches7);
echo "\n";
echo json_encode($matches8);
echo "\n";
echo json_encode($matches9);
echo "\n";
echo json_encode($matches10);
echo "\n";
echo json_encode($matches11);
echo "\n";
echo json_encode($matches12);

/*
  Output:
    [["one two three four START","four","five","two","five","six","END seven"],["","four","five","two","five","six",""]]
    [["this is the start","some","words","are","selected","end no more select"],["","some","words","are","selected",""]]
    [["these are some words that shouldn't be selected end also not selected"],[""]]
    [["end two four five two five six END seven"],[""]]
    [["one start","two","three","end four five six end seven one"],["","two","three",""]]
    [["END START"],[""]]
    [["one two three four START","four","five","five","END seven"],["","four","five","five",""]]
    [["this is the start","some","words","selected","end no more select"],["","some","words","selected",""]]
    [["these are some words that shouldn't be selected end also not selected"],[""]]
    [["end two four five two five six END seven"],[""]]
    [["one start","three","end four five six end seven one"],["","three",""]]
    [["END START"],[""]]
*/

.NET

如果您使用的是 .NET,那么 RegEx 将变得更加简单:

start(\s*(?!end)\w+\s*)*end

这是因为 .NET 允许您捕获所有出现的已撤销字符串。

其他方法

实际上,您最好将字符串拆分为子字符串并从那里求值...

输入

start one two three end start one two three end one two end

拆分字符串

start(.*?)end

[0] => start one two three end
[1] => start one two three end one two end

匹配词

\b\w{3,}

例子

$string = "start one two three end start four five six end one two end";

preg_match_all('/start(.*?)end/i', $string, $matches);

foreach($matches[1] as $match){
  preg_match_all('/\b\w{3,}/', $match, $out);
  var_dump($out);
}

/*
  Output:
    array(1) {
      [0]=>
      array(3) {
        [0]=>
        string(3) "one"
        [1]=>
        string(3) "two"
        [2]=>
        string(5) "three"
      }
    }
    array(1) {
      [0]=>
      array(3) {
        [0]=>
        string(4) "four"
        [1]=>
        string(4) "five"
        [2]=>
        string(3) "six"
      }
    }
*/

您可以将此正则表达式用于前瞻性 \G:

(?:\bSTART\b|(?!^)\G)\h+(?!END\b).*?\b(\w{3,})(?=.*?\bEND\b)

RegEx Demo

正则表达式详细信息:

  • (?:\bSTART\b|(?!^)\G):匹配词START或从上一个匹配项的末尾开始匹配0个或多个由1+个空格分隔的词。
  • \G:断言位置在前一个匹配的末尾或第一个匹配的字符串的开头
  • \h+(?!END\b).*?(\w{4,}):匹配 1+ 个空格后跟 0 个或多个字符后跟 4+ 长度的单词,该单词在第 1 组
  • 中捕获
  • (?=.*?\bEND\b):先行断言 END 前面
  • 字词的存在

如果支持后视中的量词,您也可以使用

(?<=\bSTART\s+(?:\w+\s+)*?)\w{3,}(?=(?:\s+\w+)*?\s+END\b)

说明

  • (?<=正向后视,断言左边的是
    • \bSTART\s+(?:\w+\s+)*? 匹配 START 可选择按单词和空白字符重复
  • ) 关闭回顾
  • \w{3,} 匹配 3 个或更多单词字符
  • (?=正向前瞻,断言右边是什么
    • (?:\s+\w+)*?\s+END\b 可选择重复空格和单词字符并匹配 END
  • ) 关闭前瞻

Regex demo