提取包含换行符之间匹配项的文本

Question

我正在尝试从 OCR 合同中提取包含使用 JS 的关键搜索词的段落。用户可能会搜索诸如“提前发货”之类的内容来查找与特定客户订单是否可以提前发货相关的条款。

我的头撞在正则表达式墙上已经有一段时间了，显然我只是没有掌握一些东西。

如果我有这样的文字并且正在搜索“匹配”一词：

let text = "\n\nThis is an example of a paragraph that has the word I'm looking for The word is Match. \n\nThis paragraph does not have the word I want."

我想提取双 \n 字符之间的所有文本，而不是 return 该字符串中的第二个句子。

我一直在尝试某种形式的：

let string = `[^\n\n]*match[^.]*\n\n`;

let re = new RegExp(string, "gi");
let body = text.match(re);

但是 return 无效。奇怪的是，如果我从字符串中删除句点（排序）：

[
  "This is an example of a paragraph that has the word I'm looking for The word is Match \n" +
    '\n'
]

任何帮助都会很棒。

Answer 1

如果没有任何与上下文匹配相关的技巧，在包含某些特定文本的相同分隔符之间提取一些文本是不太可能的。

因此，您可以简单地将文本拆分成段落并获得包含匹配项的段落：

const results = text.split(/\n{2,}/).filter(x=>/\bmatch\b/i.test(x))

如果您不需要整个单词匹配，您可以删除单词边界。

查看 JavaScript 演示：

let text = "\n\nThis is an example of a paragraph that has the word I'm looking for The word is Match. \n\nThis paragraph does not have the word I want.";
console.log(text.split(/\n{2,}/).filter(x=>/\bmatch\b/i.test(x)));

Answer 2

这里有两种方法。我不确定你为什么需要使用正则表达式。拆分似乎更容易做到，不是吗？

const text = "\n\nThis is an example of a paragraph that has the word I'm looking for The word is Match. \n\nThis paragraph does not have the word I want."

// regular expression one

function getTextBetweenLinesUsingRegex(text) {
  const regex = /\n\n([^(\n\n)]+)\n\n/;
  const arr = regex.exec(text);
  if (arr.length > 1) {
    return arr[1];
  }
  return null;
}

console.log(`getTextBetweenLinesUsingRegex: ${ getTextBetweenLinesUsingRegex(text)}`);

console.log(`simple: ${text.split('\n\n')[1]}`);

Answer 3

如果您使用 . 默认情况下匹配除换行符以外的所有字符这一事实，那将非常容易。使用正则表达式 /.*match.*/ 和两边的贪婪 .*:

const text = 'aaaa\n\nbbb match ccc\n\nddd';
const regex = /.*match.*/;
console.log(text.match(regex).toString());

输出：

bbb match ccc

提取包含换行符之间匹配项的文本

Extract text containing match between new line characters

javascript

regex

node.js

regex-group

regex-lookarounds