正则表达式剪切最接近关键字的词

Question

我正在尝试在 javascript 中制作关键字提取器，但它也会包含一些上下文。有很多步骤，但大多数都非常简单，只是在段落中的关键字旁边添加了不重要的词。我想把所选关键字两边的两个词连同关键字一起剪掉。例如，如果我有句子

let sentence = 'I was walking down the street when, suddenly, the TV came on.'

关键字是street，我想从句子中提取down the street when suddenly。最终我会删除所有停用词（如 the），但目前我只想提取所有词。我一直在使用正则表达式来尝试实现这一点，但没有成功。这是我的代码：

let removePunc = sentence.replace(/[.,\/#!$%\^&\*;:{}=\-_`~()]/g,""); //Removes the commas and other puncuation that could interfere with my extraction
let removeSpace = removePunc.replace(/\s{2,}/g," ");  //Removes additional whitespace that's not required
const regex = new RegExp('([^\s]+\s[^\s]+\s' + keyword + '\s[^\s]+\s[^\s]+)', 'gs') //Here's where I was trying to get the two words on either side of the keyword, although it currently doesn't work
let keywordZone = regex.exec(removeSpace); //This is where the regex above should "cut out" the phrase I want

我对正则表达式不是很好，我有点困惑为什么它不能正常工作，因为它似乎适用于 this regex simulator.

上的特定示例

如果我现在尝试，它什么也做不了。例如，句子 Lawmakers, flight attendants, passengers oppose TSA proposal to cut screening at airports first reported by CNN 和关键字 proposal 根本不做任何事情。

提前感谢大家的回复，非常感谢！

Answer 1

删除标点符号后，您可以将句子拆分为每个 space 和 select 数组中单词前后的两个元素：

let sentence = 'I was walking down the street when, suddenly, the TV came on.'
let keyword = "street";


let removePunc = sentence.replace(/[.,\/#!$%\^&\*;:{}=\-_`~()]/g,""); //Removes the commas and other puncuation that could interfere with my extraction

let wordArr = removePunc.split(" ");

let keyPos = wordArr.indexOf(keyword);

let newSentence = [wordArr[keyPos-2], wordArr[keyPos-1], wordArr[keyPos], wordArr[keyPos+1], wordArr[keyPos+2],].join(" ");

console.log(newSentence)

如果你把它放到一个函数中，你也可以很容易地在其他字符串上测试它：

function nearestFourWords(sentence, keyword) {
  let removePunc = sentence.replace(/[.,\/#!$%\^&\*;:{}=\-_`~()]/g, ""); //Removes the commas and other puncuation that could interfere with my extraction

  let wordArr = removePunc.split(" ");

  let keyPos = wordArr.indexOf(keyword);

  let newSentence = [wordArr[keyPos - 2], wordArr[keyPos - 1], wordArr[keyPos], wordArr[keyPos + 1], wordArr[keyPos + 2], ].join(" ");

  return newSentence
}

test1 = ["Lawmakers, flight attendants, passengers oppose TSA proposal to cut screening at airports first reported by CNN", "proposal"];

console.log(nearestFourWords(test1[0], test1[1]));

如果您稍后想删除像 the 这样的词，只需在拆分之前添加这些行！

Answer 2

如果你want/need使用正则表达式，那么这里有一个简单的方法。

const sentence = 'I was walking down the street when, suddenly, the TV came on.'
const keyword = 'street';
const regex = `\w+\W+\w+\W+${keyword}\W+\w+\W+\w+`;

console.log(sentence.match(regex));

将其重构为一个函数很快就会显示出一个缺点，即如果关键字位于字符串开头或结尾的两个单词内，则搜索将失败。

const sentence = 'I was walking down the street when, suddenly, the TV came on.'

console.log({
  street: keywordSearch(sentence, 'street'),
  I: keywordSearch(sentence, 'I'),
  was: keywordSearch(sentence, 'was'),
  came: keywordSearch(sentence, 'came'),
  on: keywordSearch(sentence, 'on')
});

function keywordSearch(str, key) {
  const regex = `\w+\W+\w+\W+${key}\W+\w+\W+\w+`;
  
  return str.match(regex);
}

这可以通过使用可选分组来缓解。

const sentence = 'I was walking down the street when, suddenly, the TV came on.'

console.log({
  street: keywordSearch(sentence, 'street'),
  I: keywordSearch(sentence, 'I'),
  was: keywordSearch(sentence, 'was'),
  came: keywordSearch(sentence, 'came'),
  on: keywordSearch(sentence, 'on')
});

function keywordSearch(str, key) {
  const regex = `(?:\w+\W+|)(?:\w+\W+|)${key}(?:\W+\w+|)(?:\W+\w+|)`;
  
  return str.match(regex);
}

希望这能让你上路。

正则表达式剪切最接近关键字的词

Regex Cut Words nearest to Keyword

javascript

regex

extract

keyword