JScript 正则表达式 - 在包含 1 次或多次出现的另一个词的 2 个词之间提取子字符串

Question

我有一段从大型 PDF 文件中提取的文本。我只对本文的一部分感兴趣。我只需要出现在 2 个 test 个子字符串之间并且出现 1 次或多次特定单词 XX12QW 的部分。在这 2 个 test substrings/words 中，第一个可以包含在匹配中，如下面的所需输出所示

输入字符串：

test 
abc def 
test 123 
test pqr 
XX12QW
jkl XX12QW hjas 
12asd23 test bxs

期望输出：

test pqr 
XX12QW
jkl XX12QW hjas 
12asd23

注意事项：

子串 test.
我只需要 2 substrings/words - test 之间的部分，其中包含单词 XX12QW 出现 1 次或多次。这个词 XX12QW 根本不会出现在任何 other pairs 之间 - test。也就是说，永远不会有这样的情况：test abc XX12QW test isadkj XX12QW test an test
一个额外的测试用例是 XX12QW 出现在 test 和 $(End of string/file) 之间：
- 输入：test absjh123 sjnc test jhsd32 test aabb XX12QW asdj XX12QW sdfk
- 期望输出：test aabb XX12QW asdj XX12QW sdfk

我被困在这个问题上很长时间了，真的需要别人看看。

正则表达式：test[\s\S]*?XX12QW[\s\S]*?(?=test)

非常感谢任何帮助。

Answer 1

纯正则表达式的解决方案是可能的，但最好用 test 拆分并从数组中获取包含 XX12QW 的项目并在开头附加 test :

var s = "test \nabc def \ntest 123 \ntest pqr \nXX12QW\njkl XX12QW hjas \n12asd23 test bxs";
var res = s.split('test').slice(1)   // Split with 'test' and remove 1st item
       .filter(function(x) {return ~x.indexOf("XX12QW");}) // Keep those with XX12QW
       .map(function(y) {return ("test"+y).trim();});  // Append test back and trim
console.log(res);

单个正则表达式解决方案可能类似于

/test(?:(?!test)[^])*?XX12QW[^]*?(?=\s*test)/

见regex demo

详情

test - 文字 test 子串
(?:(?!test)[^])*? - 匹配任何字符，0+ 个字符，尽可能少，除了那些开始 test 字符序列
XX12QW - 文字 XX12QW 子串
[^]*? - 任何 0+ 个字符，尽可能少，最多（不包括...）
(?=\s*test) - 0+ 个空格后跟 test 子字符串。

JScript 正则表达式 - 在包含 1 次或多次出现的另一个词的 2 个词之间提取子字符串

JScript regex - extract a substring between 2 words containing 1 or more occurrences of another word

javascript

regex

jscript