REGEX 捕获一个句子的两个单词之间的每个 n 个字母的单词
REGEX capture each n-letters words between two words of a sentence
我很难尝试 select 句子的两个单词之间只有 n 个长度的单词:
前任:
对于声明:
“这是开始,一些单词将 selected 结束,不再 select”
假设我想 select 单词 'start' 和 'end' 之间的 3+ 个单词,结果将捕获
一些单词 selected 忽略了并且是。
https://regex101.com/r/Ost7Wn/3
只是 selecting [\w]{3,} 本身就可以工作,但我不知道如何将它放在 'start' 和 'end' 之间匹配我的 n 字母单词的句子只出现在它们之间。我尝试了很多东西,从环顾四周到捕获组,但我真的做不到!
有什么想法吗?谢谢
这是一个有趣的场景。通常,您最好从主源中提取字符串 start(.*)end
,然后在子字符串上 运行 您的正则表达式。
但这并不意味着不可能使用一个 RegEx!
我相信您已经为 positive|negative
lookahead|lookbehind
苦苦挣扎,并且发现这真的很麻烦,您不能进行动态长度回顾,例如(?<=start.*)
和 lookahead
.
一样
对于此示例,您必须了解的关键是 RegEx 在字符串匹配时移动 cursor
位置...这是我们将用来完成此工作的警告。
正则表达式
(?:.*start|^.*|end.*)|\b(\w{3,})(?=.*end)
(?: : Start of non-capture group
.*start : [Match pattern 1a] matches anything upto and including the {start} anchor
| : OR operator
^.* : [Match pattern 1b] matches from the {^} start of the string to the end
| : OR operator
end.* : [Match pattern 1c] matches from the {end} anchor to the end of the string
) : End of non-capture group
| : OR operator
\b(\w{3,})(?=.*end) : Captures a boundary [\b] followed by word characters [a-zA-Z0-9_] 3 or more times whilst using a positive lookahead to check that the {end} anchor hasn't been passed
罗嗦解释
上面的RegEx可以写成,简单来说:
NON-CAPTURING_GROUP OR CAPTURING_GROUP
OR, more verbose
(MATCH_PATTERN_1a OR MATCH_PATTERN_1b OR MATCH_PATTERN_1c) OR MATCH_PATTERN_2
NON-CAPTURING_GROUP
总是先求值,所以我们在这里检查我们是否真的想要匹配
MATCH_PATTERN_1a
检查 start
锚点是否存在并将 cursor
移动到字符串中的那个点
MATCH_PATTERN_1b
仅在 1a
失败且字符串中存在 start
锚点时匹配。如果是这样,它匹配所有内容并且表达式停止。
MATCH_PATTERN_1c
检查是否未到达 end
锚点。如果有,则匹配到字符串的末尾,表达式停止。
CAPTURING_GROUP
总是排在第二位;所以只匹配如果它应该
MATCH_PATTERN_2
匹配指定长度之间后跟单词字符 [a-zA-Z0-9_]
的任何单词边界
- 它还会检查
positive lookahead
以确保未通过 end
锚点
警告
请注意,第一个和最后一个捕获将始终来自 NON-CAPTURE
组,应忽略。根据正则表达式的实现方式,它可能为空、完整匹配字符串或两者兼有(multi-dimensional 数组)。
示例[Python]
注意:Python
以 $result = $full_matches[]
格式输出
注意:flag = re.I
已设置为使 RegEx 不区分大小写,即它匹配 START
和 start
import re
test_str1 = """one two three four START four five two five six END seven"""
test_str2 = """this is the start some words are to be selected end no more select"""
test_str3 = """these are some words that shouldn't be selected end also not selected"""
test_str4 = """end two four five two five six END seven"""
test_str5 = """one start two three end four five six end seven one"""
test_str6 = """END START two four five two five six seven"""
regex1 = r"(?:.*start|^.*|end.*)|\b(\w{3,})(?=.*end)"
regex2 = r"(?:.*start|^.*|end.*)|\b(\w{4,})(?=.*end)"
print(re.findall(regex1, test_str1, re.I))
print(re.findall(regex1, test_str2, re.I))
print(re.findall(regex1, test_str3, re.I))
print(re.findall(regex1, test_str4, re.I))
print(re.findall(regex1, test_str5, re.I))
print(re.findall(regex1, test_str6, re.I))
print(re.findall(regex2, test_str1, re.I))
print(re.findall(regex2, test_str2, re.I))
print(re.findall(regex2, test_str3, re.I))
print(re.findall(regex2, test_str4, re.I))
print(re.findall(regex2, test_str5, re.I))
print(re.findall(regex2, test_str6, re.I))
'''
Output:
['', 'four', 'five', 'two', 'five', 'six', '']
['', 'some', 'words', 'are', 'selected', '']
['']
['']
['', 'two', 'three', '']
['']
['', 'four', 'five', 'five', '']
['', 'some', 'words', 'selected', '']
['']
['']
['', 'three', '']
['']
'''
示例[PHP]
注意:PHP
以 $result = [$full_matches[], $capture_group[]]
格式输出
注意:flag = i
已设置为使 RegEx 不区分大小写,即它匹配 START
和 start
$test_str1 = "one two three four START four five two five six END seven";
$test_str2 = "this is the start some words are to be selected end no more select";
$test_str3 = "these are some words that shouldn't be selected end also not selected";
$test_str4 = "end two four five two five six END seven";
$test_str5 = "one start two three end four five six end seven one";
$test_str6 = "END START two four five two five six seven";
$regex1 = "/(?:.*start|^.*|end.*)|\b(\w{3,})(?=.*end)/i";
$regex2 = "/(?:.*start|^.*|end.*)|\b(\w{4,})(?=.*end)/i";
preg_match_all($regex1, $test_str1, $matches1);
preg_match_all($regex1, $test_str2, $matches2);
preg_match_all($regex1, $test_str3, $matches3);
preg_match_all($regex1, $test_str4, $matches4);
preg_match_all($regex1, $test_str5, $matches5);
preg_match_all($regex1, $test_str6, $matches6);
preg_match_all($regex2, $test_str1, $matches7);
preg_match_all($regex2, $test_str2, $matches8);
preg_match_all($regex2, $test_str3, $matches9);
preg_match_all($regex2, $test_str4, $matches10);
preg_match_all($regex2, $test_str5, $matches11);
preg_match_all($regex2, $test_str6, $matches12);
echo json_encode($matches1);
echo "\n";
echo json_encode($matches2);
echo "\n";
echo json_encode($matches3);
echo "\n";
echo json_encode($matches4);
echo "\n";
echo json_encode($matches5);
echo "\n";
echo json_encode($matches6);
echo "\n";
echo json_encode($matches7);
echo "\n";
echo json_encode($matches8);
echo "\n";
echo json_encode($matches9);
echo "\n";
echo json_encode($matches10);
echo "\n";
echo json_encode($matches11);
echo "\n";
echo json_encode($matches12);
/*
Output:
[["one two three four START","four","five","two","five","six","END seven"],["","four","five","two","five","six",""]]
[["this is the start","some","words","are","selected","end no more select"],["","some","words","are","selected",""]]
[["these are some words that shouldn't be selected end also not selected"],[""]]
[["end two four five two five six END seven"],[""]]
[["one start","two","three","end four five six end seven one"],["","two","three",""]]
[["END START"],[""]]
[["one two three four START","four","five","five","END seven"],["","four","five","five",""]]
[["this is the start","some","words","selected","end no more select"],["","some","words","selected",""]]
[["these are some words that shouldn't be selected end also not selected"],[""]]
[["end two four five two five six END seven"],[""]]
[["one start","three","end four five six end seven one"],["","three",""]]
[["END START"],[""]]
*/
.NET
如果您使用的是 .NET
,那么 RegEx 将变得更加简单:
start(\s*(?!end)\w+\s*)*end
这是因为 .NET
允许您捕获所有出现的已撤销字符串。
其他方法
实际上,您最好将字符串拆分为子字符串并从那里求值...
输入
start one two three end start one two three end one two end
拆分字符串
start(.*?)end
[0] => start one two three end
[1] => start one two three end one two end
匹配词
\b\w{3,}
例子
$string = "start one two three end start four five six end one two end";
preg_match_all('/start(.*?)end/i', $string, $matches);
foreach($matches[1] as $match){
preg_match_all('/\b\w{3,}/', $match, $out);
var_dump($out);
}
/*
Output:
array(1) {
[0]=>
array(3) {
[0]=>
string(3) "one"
[1]=>
string(3) "two"
[2]=>
string(5) "three"
}
}
array(1) {
[0]=>
array(3) {
[0]=>
string(4) "four"
[1]=>
string(4) "five"
[2]=>
string(3) "six"
}
}
*/
您可以将此正则表达式用于前瞻性 \G
:
(?:\bSTART\b|(?!^)\G)\h+(?!END\b).*?\b(\w{3,})(?=.*?\bEND\b)
正则表达式详细信息:
(?:\bSTART\b|(?!^)\G)
:匹配词START
或从上一个匹配项的末尾开始匹配0个或多个由1+个空格分隔的词。
\G
:断言位置在前一个匹配的末尾或第一个匹配的字符串的开头
\h+(?!END\b).*?(\w{4,})
:匹配 1+ 个空格后跟 0 个或多个字符后跟 4+ 长度的单词,该单词在第 1 组 中捕获
(?=.*?\bEND\b)
:先行断言 END
前面 字词的存在
如果支持后视中的量词,您也可以使用
(?<=\bSTART\s+(?:\w+\s+)*?)\w{3,}(?=(?:\s+\w+)*?\s+END\b)
说明
(?<=
正向后视,断言左边的是
\bSTART\s+(?:\w+\s+)*?
匹配 START 可选择按单词和空白字符重复
)
关闭回顾
\w{3,}
匹配 3 个或更多单词字符
(?=
正向前瞻,断言右边是什么
(?:\s+\w+)*?\s+END\b
可选择重复空格和单词字符并匹配 END
)
关闭前瞻
我很难尝试 select 句子的两个单词之间只有 n 个长度的单词: 前任: 对于声明: “这是开始,一些单词将 selected 结束,不再 select”
假设我想 select 单词 'start' 和 'end' 之间的 3+ 个单词,结果将捕获 一些单词 selected 忽略了并且是。
https://regex101.com/r/Ost7Wn/3
只是 selecting [\w]{3,} 本身就可以工作,但我不知道如何将它放在 'start' 和 'end' 之间匹配我的 n 字母单词的句子只出现在它们之间。我尝试了很多东西,从环顾四周到捕获组,但我真的做不到!
有什么想法吗?谢谢
这是一个有趣的场景。通常,您最好从主源中提取字符串 start(.*)end
,然后在子字符串上 运行 您的正则表达式。
但这并不意味着不可能使用一个 RegEx!
我相信您已经为 positive|negative
lookahead|lookbehind
苦苦挣扎,并且发现这真的很麻烦,您不能进行动态长度回顾,例如(?<=start.*)
和 lookahead
.
对于此示例,您必须了解的关键是 RegEx 在字符串匹配时移动 cursor
位置...这是我们将用来完成此工作的警告。
正则表达式
(?:.*start|^.*|end.*)|\b(\w{3,})(?=.*end)
(?: : Start of non-capture group
.*start : [Match pattern 1a] matches anything upto and including the {start} anchor
| : OR operator
^.* : [Match pattern 1b] matches from the {^} start of the string to the end
| : OR operator
end.* : [Match pattern 1c] matches from the {end} anchor to the end of the string
) : End of non-capture group
| : OR operator
\b(\w{3,})(?=.*end) : Captures a boundary [\b] followed by word characters [a-zA-Z0-9_] 3 or more times whilst using a positive lookahead to check that the {end} anchor hasn't been passed
罗嗦解释
上面的RegEx可以写成,简单来说:
NON-CAPTURING_GROUP OR CAPTURING_GROUP
OR, more verbose
(MATCH_PATTERN_1a OR MATCH_PATTERN_1b OR MATCH_PATTERN_1c) OR MATCH_PATTERN_2
NON-CAPTURING_GROUP
总是先求值,所以我们在这里检查我们是否真的想要匹配MATCH_PATTERN_1a
检查start
锚点是否存在并将cursor
移动到字符串中的那个点MATCH_PATTERN_1b
仅在1a
失败且字符串中存在start
锚点时匹配。如果是这样,它匹配所有内容并且表达式停止。MATCH_PATTERN_1c
检查是否未到达end
锚点。如果有,则匹配到字符串的末尾,表达式停止。
CAPTURING_GROUP
总是排在第二位;所以只匹配如果它应该MATCH_PATTERN_2
匹配指定长度之间后跟单词字符[a-zA-Z0-9_]
的任何单词边界- 它还会检查
positive lookahead
以确保未通过end
锚点
- 它还会检查
警告
请注意,第一个和最后一个捕获将始终来自 NON-CAPTURE
组,应忽略。根据正则表达式的实现方式,它可能为空、完整匹配字符串或两者兼有(multi-dimensional 数组)。
示例[Python]
注意:Python
以 $result = $full_matches[]
格式输出
注意:flag = re.I
已设置为使 RegEx 不区分大小写,即它匹配 START
和 start
import re
test_str1 = """one two three four START four five two five six END seven"""
test_str2 = """this is the start some words are to be selected end no more select"""
test_str3 = """these are some words that shouldn't be selected end also not selected"""
test_str4 = """end two four five two five six END seven"""
test_str5 = """one start two three end four five six end seven one"""
test_str6 = """END START two four five two five six seven"""
regex1 = r"(?:.*start|^.*|end.*)|\b(\w{3,})(?=.*end)"
regex2 = r"(?:.*start|^.*|end.*)|\b(\w{4,})(?=.*end)"
print(re.findall(regex1, test_str1, re.I))
print(re.findall(regex1, test_str2, re.I))
print(re.findall(regex1, test_str3, re.I))
print(re.findall(regex1, test_str4, re.I))
print(re.findall(regex1, test_str5, re.I))
print(re.findall(regex1, test_str6, re.I))
print(re.findall(regex2, test_str1, re.I))
print(re.findall(regex2, test_str2, re.I))
print(re.findall(regex2, test_str3, re.I))
print(re.findall(regex2, test_str4, re.I))
print(re.findall(regex2, test_str5, re.I))
print(re.findall(regex2, test_str6, re.I))
'''
Output:
['', 'four', 'five', 'two', 'five', 'six', '']
['', 'some', 'words', 'are', 'selected', '']
['']
['']
['', 'two', 'three', '']
['']
['', 'four', 'five', 'five', '']
['', 'some', 'words', 'selected', '']
['']
['']
['', 'three', '']
['']
'''
示例[PHP]
注意:PHP
以 $result = [$full_matches[], $capture_group[]]
格式输出
注意:flag = i
已设置为使 RegEx 不区分大小写,即它匹配 START
和 start
$test_str1 = "one two three four START four five two five six END seven";
$test_str2 = "this is the start some words are to be selected end no more select";
$test_str3 = "these are some words that shouldn't be selected end also not selected";
$test_str4 = "end two four five two five six END seven";
$test_str5 = "one start two three end four five six end seven one";
$test_str6 = "END START two four five two five six seven";
$regex1 = "/(?:.*start|^.*|end.*)|\b(\w{3,})(?=.*end)/i";
$regex2 = "/(?:.*start|^.*|end.*)|\b(\w{4,})(?=.*end)/i";
preg_match_all($regex1, $test_str1, $matches1);
preg_match_all($regex1, $test_str2, $matches2);
preg_match_all($regex1, $test_str3, $matches3);
preg_match_all($regex1, $test_str4, $matches4);
preg_match_all($regex1, $test_str5, $matches5);
preg_match_all($regex1, $test_str6, $matches6);
preg_match_all($regex2, $test_str1, $matches7);
preg_match_all($regex2, $test_str2, $matches8);
preg_match_all($regex2, $test_str3, $matches9);
preg_match_all($regex2, $test_str4, $matches10);
preg_match_all($regex2, $test_str5, $matches11);
preg_match_all($regex2, $test_str6, $matches12);
echo json_encode($matches1);
echo "\n";
echo json_encode($matches2);
echo "\n";
echo json_encode($matches3);
echo "\n";
echo json_encode($matches4);
echo "\n";
echo json_encode($matches5);
echo "\n";
echo json_encode($matches6);
echo "\n";
echo json_encode($matches7);
echo "\n";
echo json_encode($matches8);
echo "\n";
echo json_encode($matches9);
echo "\n";
echo json_encode($matches10);
echo "\n";
echo json_encode($matches11);
echo "\n";
echo json_encode($matches12);
/*
Output:
[["one two three four START","four","five","two","five","six","END seven"],["","four","five","two","five","six",""]]
[["this is the start","some","words","are","selected","end no more select"],["","some","words","are","selected",""]]
[["these are some words that shouldn't be selected end also not selected"],[""]]
[["end two four five two five six END seven"],[""]]
[["one start","two","three","end four five six end seven one"],["","two","three",""]]
[["END START"],[""]]
[["one two three four START","four","five","five","END seven"],["","four","five","five",""]]
[["this is the start","some","words","selected","end no more select"],["","some","words","selected",""]]
[["these are some words that shouldn't be selected end also not selected"],[""]]
[["end two four five two five six END seven"],[""]]
[["one start","three","end four five six end seven one"],["","three",""]]
[["END START"],[""]]
*/
.NET
如果您使用的是 .NET
,那么 RegEx 将变得更加简单:
start(\s*(?!end)\w+\s*)*end
这是因为 .NET
允许您捕获所有出现的已撤销字符串。
其他方法
实际上,您最好将字符串拆分为子字符串并从那里求值...
输入
start one two three end start one two three end one two end
拆分字符串
start(.*?)end
[0] => start one two three end
[1] => start one two three end one two end
匹配词
\b\w{3,}
例子
$string = "start one two three end start four five six end one two end";
preg_match_all('/start(.*?)end/i', $string, $matches);
foreach($matches[1] as $match){
preg_match_all('/\b\w{3,}/', $match, $out);
var_dump($out);
}
/*
Output:
array(1) {
[0]=>
array(3) {
[0]=>
string(3) "one"
[1]=>
string(3) "two"
[2]=>
string(5) "three"
}
}
array(1) {
[0]=>
array(3) {
[0]=>
string(4) "four"
[1]=>
string(4) "five"
[2]=>
string(3) "six"
}
}
*/
您可以将此正则表达式用于前瞻性 \G
:
(?:\bSTART\b|(?!^)\G)\h+(?!END\b).*?\b(\w{3,})(?=.*?\bEND\b)
正则表达式详细信息:
(?:\bSTART\b|(?!^)\G)
:匹配词START
或从上一个匹配项的末尾开始匹配0个或多个由1+个空格分隔的词。\G
:断言位置在前一个匹配的末尾或第一个匹配的字符串的开头\h+(?!END\b).*?(\w{4,})
:匹配 1+ 个空格后跟 0 个或多个字符后跟 4+ 长度的单词,该单词在第 1 组 中捕获
(?=.*?\bEND\b)
:先行断言END
前面 字词的存在
如果支持后视中的量词,您也可以使用
(?<=\bSTART\s+(?:\w+\s+)*?)\w{3,}(?=(?:\s+\w+)*?\s+END\b)
说明
(?<=
正向后视,断言左边的是\bSTART\s+(?:\w+\s+)*?
匹配 START 可选择按单词和空白字符重复
)
关闭回顾\w{3,}
匹配 3 个或更多单词字符(?=
正向前瞻,断言右边是什么(?:\s+\w+)*?\s+END\b
可选择重复空格和单词字符并匹配 END
)
关闭前瞻