REGEX 捕获一个句子的两个单词之间的每个 n 个字母的单词

Question

我很难尝试 select 句子的两个单词之间只有 n 个长度的单词：前任：对于声明： “这是开始，一些单词将 selected 结束，不再 select”

假设我想 select 单词 'start' 和 'end' 之间的 3+ 个单词，结果将捕获一些单词 selected 忽略了并且是。

https://regex101.com/r/Ost7Wn/3

只是 selecting [\w]{3,} 本身就可以工作，但我不知道如何将它放在 'start' 和 'end' 之间匹配我的 n 字母单词的句子只出现在它们之间。我尝试了很多东西，从环顾四周到捕获组，但我真的做不到！

有什么想法吗？谢谢

Answer 1

这是一个有趣的场景。通常，您最好从主源中提取字符串 start(.*)end，然后在子字符串上运行您的正则表达式。

但这并不意味着不可能使用一个 RegEx！

我相信您已经为 positive|negative lookahead|lookbehind 苦苦挣扎，并且发现这真的很麻烦，您不能进行动态长度回顾，例如(?<=start.*) 和 lookahead.

一样

对于此示例，您必须了解的关键是 RegEx 在字符串匹配时移动 cursor 位置...这是我们将用来完成此工作的警告。

正则表达式

(?:.*start|^.*|end.*)|\b(\w{3,})(?=.*end)
(?:                                         : Start of non-capture group
   .*start                                  : [Match pattern 1a] matches anything upto and including the {start} anchor
          |                                 : OR operator
           ^.*                              : [Match pattern 1b] matches from the {^} start of the string to the end
              |                             : OR operator
               end.*                        : [Match pattern 1c] matches from the {end} anchor to the end of the string
                    )                       : End of non-capture group
                     |                      : OR operator
                      \b(\w{3,})(?=.*end)   : Captures a boundary [\b] followed by word characters [a-zA-Z0-9_] 3 or more times whilst using a positive lookahead to check that the {end} anchor hasn't been passed

罗嗦解释

上面的RegEx可以写成，简单来说：

    NON-CAPTURING_GROUP OR CAPTURING_GROUP
OR, more verbose
    (MATCH_PATTERN_1a OR MATCH_PATTERN_1b OR MATCH_PATTERN_1c) OR MATCH_PATTERN_2

NON-CAPTURING_GROUP 总是先求值，所以我们在这里检查我们是否真的想要匹配
- MATCH_PATTERN_1a 检查 start 锚点是否存在并将 cursor 移动到字符串中的那个点
- MATCH_PATTERN_1b 仅在 1a 失败且字符串中存在 start 锚点时匹配。如果是这样，它匹配所有内容并且表达式停止。
- MATCH_PATTERN_1c 检查是否未到达 end 锚点。如果有，则匹配到字符串的末尾，表达式停止。
CAPTURING_GROUP 总是排在第二位；所以只匹配如果它应该
- MATCH_PATTERN_2 匹配指定长度之间后跟单词字符 [a-zA-Z0-9_] 的任何单词边界
  - 它还会检查 positive lookahead 以确保未通过 end 锚点

警告

请注意，第一个和最后一个捕获将始终来自 NON-CAPTURE 组，应忽略。根据正则表达式的实现方式，它可能为空、完整匹配字符串或两者兼有（multi-dimensional 数组）。

示例[Python]

注意：Python 以 $result = $full_matches[] 格式输出注意：flag = re.I 已设置为使 RegEx 不区分大小写，即它匹配 START 和 start

import re

test_str1 = """one two three four START four five two five six END seven"""
test_str2 = """this is the start some words are to be selected end no more select"""
test_str3 = """these are some words that shouldn't be selected end also not selected"""
test_str4 = """end two four five two five six END seven"""
test_str5 = """one start two three end four five six end seven one"""
test_str6 = """END START two four five two five six seven"""

regex1 = r"(?:.*start|^.*|end.*)|\b(\w{3,})(?=.*end)"
regex2 = r"(?:.*start|^.*|end.*)|\b(\w{4,})(?=.*end)"

print(re.findall(regex1, test_str1, re.I))
print(re.findall(regex1, test_str2, re.I))
print(re.findall(regex1, test_str3, re.I))
print(re.findall(regex1, test_str4, re.I))
print(re.findall(regex1, test_str5, re.I))
print(re.findall(regex1, test_str6, re.I))

print(re.findall(regex2, test_str1, re.I))      
print(re.findall(regex2, test_str2, re.I))
print(re.findall(regex2, test_str3, re.I))
print(re.findall(regex2, test_str4, re.I))
print(re.findall(regex2, test_str5, re.I))
print(re.findall(regex2, test_str6, re.I))

'''
  Output:
    ['', 'four', 'five', 'two', 'five', 'six', '']
    ['', 'some', 'words', 'are', 'selected', '']
    ['']
    ['']
    ['', 'two', 'three', '']
    ['']
    ['', 'four', 'five', 'five', '']
    ['', 'some', 'words', 'selected', '']
    ['']
    ['']
    ['', 'three', '']
    ['']
'''

示例[PHP]

注意：PHP 以 $result = [$full_matches[], $capture_group[]] 格式输出注意：flag = i 已设置为使 RegEx 不区分大小写，即它匹配 START 和 start

$test_str1 = "one two three four START four five two five six END seven";
$test_str2 = "this is the start some words are to be selected end no more select";
$test_str3 = "these are some words that shouldn't be selected end also not selected";
$test_str4 = "end two four five two five six END seven";
$test_str5 = "one start two three end four five six end seven one";
$test_str6 = "END START two four five two five six seven";

$regex1 = "/(?:.*start|^.*|end.*)|\b(\w{3,})(?=.*end)/i";
$regex2 = "/(?:.*start|^.*|end.*)|\b(\w{4,})(?=.*end)/i";

preg_match_all($regex1, $test_str1, $matches1);
preg_match_all($regex1, $test_str2, $matches2);
preg_match_all($regex1, $test_str3, $matches3);
preg_match_all($regex1, $test_str4, $matches4);
preg_match_all($regex1, $test_str5, $matches5);
preg_match_all($regex1, $test_str6, $matches6);

preg_match_all($regex2, $test_str1, $matches7);
preg_match_all($regex2, $test_str2, $matches8);
preg_match_all($regex2, $test_str3, $matches9);
preg_match_all($regex2, $test_str4, $matches10);
preg_match_all($regex2, $test_str5, $matches11);
preg_match_all($regex2, $test_str6, $matches12);

echo json_encode($matches1);
echo "\n";
echo json_encode($matches2);
echo "\n";
echo json_encode($matches3);
echo "\n";
echo json_encode($matches4);
echo "\n";
echo json_encode($matches5);
echo "\n";
echo json_encode($matches6);
echo "\n";
echo json_encode($matches7);
echo "\n";
echo json_encode($matches8);
echo "\n";
echo json_encode($matches9);
echo "\n";
echo json_encode($matches10);
echo "\n";
echo json_encode($matches11);
echo "\n";
echo json_encode($matches12);

/*
  Output:
    [["one two three four START","four","five","two","five","six","END seven"],["","four","five","two","five","six",""]]
    [["this is the start","some","words","are","selected","end no more select"],["","some","words","are","selected",""]]
    [["these are some words that shouldn't be selected end also not selected"],[""]]
    [["end two four five two five six END seven"],[""]]
    [["one start","two","three","end four five six end seven one"],["","two","three",""]]
    [["END START"],[""]]
    [["one two three four START","four","five","five","END seven"],["","four","five","five",""]]
    [["this is the start","some","words","selected","end no more select"],["","some","words","selected",""]]
    [["these are some words that shouldn't be selected end also not selected"],[""]]
    [["end two four five two five six END seven"],[""]]
    [["one start","three","end four five six end seven one"],["","three",""]]
    [["END START"],[""]]
*/

.NET

如果您使用的是 .NET，那么 RegEx 将变得更加简单：

start(\s*(?!end)\w+\s*)*end

这是因为 .NET 允许您捕获所有出现的已撤销字符串。

其他方法

实际上，您最好将字符串拆分为子字符串并从那里求值...

输入

start one two three end start one two three end one two end

拆分字符串

start(.*?)end

[0] => start one two three end
[1] => start one two three end one two end

匹配词

\b\w{3,}

例子

$string = "start one two three end start four five six end one two end";

preg_match_all('/start(.*?)end/i', $string, $matches);

foreach($matches[1] as $match){
  preg_match_all('/\b\w{3,}/', $match, $out);
  var_dump($out);
}

/*
  Output:
    array(1) {
      [0]=>
      array(3) {
        [0]=>
        string(3) "one"
        [1]=>
        string(3) "two"
        [2]=>
        string(5) "three"
      }
    }
    array(1) {
      [0]=>
      array(3) {
        [0]=>
        string(4) "four"
        [1]=>
        string(4) "five"
        [2]=>
        string(3) "six"
      }
    }
*/

Answer 2

您可以将此正则表达式用于前瞻性 \G:

(?:\bSTART\b|(?!^)\G)\h+(?!END\b).*?\b(\w{3,})(?=.*?\bEND\b)

RegEx Demo

正则表达式详细信息：

(?:\bSTART\b|(?!^)\G)：匹配词START或从上一个匹配项的末尾开始匹配0个或多个由1+个空格分隔的词。
\G：断言位置在前一个匹配的末尾或第一个匹配的字符串的开头
\h+(?!END\b).*?(\w{4,})：匹配 1+ 个空格后跟 0 个或多个字符后跟 4+ 长度的单词，该单词在第 1 组
(?=.*?\bEND\b)：先行断言 END 前面

Answer 3

如果支持后视中的量词，您也可以使用

(?<=\bSTART\s+(?:\w+\s+)*?)\w{3,}(?=(?:\s+\w+)*?\s+END\b)

说明

(?<=正向后视，断言左边的是
- \bSTART\s+(?:\w+\s+)*? 匹配 START 可选择按单词和空白字符重复
) 关闭回顾
\w{3,} 匹配 3 个或更多单词字符
(?=正向前瞻，断言右边是什么
- (?:\s+\w+)*?\s+END\b 可选择重复空格和单词字符并匹配 END
) 关闭前瞻

Regex demo

REGEX 捕获一个句子的两个单词之间的每个 n 个字母的单词

REGEX capture each n-letters words between two words of a sentence

regex

regex-lookarounds

.NET

其他方法