正则表达式剪切最接近关键字的词

Regex Cut Words nearest to Keyword

我正在尝试在 javascript 中制作关键字提取器,但它也会包含一些上下文。有很多步骤,但大多数都非常简单,只是在段落中的关键字旁边添加了不重要的词。我想把所选关键字两边的两个词连同关键字一起剪掉。例如,如果我有句子

let sentence = 'I was walking down the street when, suddenly, the TV came on.'

关键字是street,我想从句子中提取down the street when suddenly。最终我会删除所有停用词(如 the),但目前我只想提取所有词。我一直在使用正则表达式来尝试实现这一点,但没有成功。这是我的代码:

let removePunc = sentence.replace(/[.,\/#!$%\^&\*;:{}=\-_`~()]/g,""); //Removes the commas and other puncuation that could interfere with my extraction
let removeSpace = removePunc.replace(/\s{2,}/g," ");  //Removes additional whitespace that's not required
const regex = new RegExp('([^\s]+\s[^\s]+\s' + keyword + '\s[^\s]+\s[^\s]+)', 'gs') //Here's where I was trying to get the two words on either side of the keyword, although it currently doesn't work
let keywordZone = regex.exec(removeSpace); //This is where the regex above should "cut out" the phrase I want

我对正则表达式不是很好,我有点困惑为什么它不能正常工作,因为它似乎适用于 this regex simulator.

上的特定示例

如果我现在尝试,它什么也做不了。例如,句子 Lawmakers, flight attendants, passengers oppose TSA proposal to cut screening at airports first reported by CNN 和关键字 proposal 根本不做任何事情。

提前感谢大家的回复,非常感谢!

删除标点符号后,您可以将句子拆分为每个 space 和 select 数组中单词前后的两个元素:

let sentence = 'I was walking down the street when, suddenly, the TV came on.'
let keyword = "street";


let removePunc = sentence.replace(/[.,\/#!$%\^&\*;:{}=\-_`~()]/g,""); //Removes the commas and other puncuation that could interfere with my extraction

let wordArr = removePunc.split(" ");

let keyPos = wordArr.indexOf(keyword);

let newSentence = [wordArr[keyPos-2], wordArr[keyPos-1], wordArr[keyPos], wordArr[keyPos+1], wordArr[keyPos+2],].join(" ");

console.log(newSentence)

如果你把它放到一个函数中,你也可以很容易地在其他字符串上测试它:

function nearestFourWords(sentence, keyword) {
  let removePunc = sentence.replace(/[.,\/#!$%\^&\*;:{}=\-_`~()]/g, ""); //Removes the commas and other puncuation that could interfere with my extraction

  let wordArr = removePunc.split(" ");

  let keyPos = wordArr.indexOf(keyword);

  let newSentence = [wordArr[keyPos - 2], wordArr[keyPos - 1], wordArr[keyPos], wordArr[keyPos + 1], wordArr[keyPos + 2], ].join(" ");

  return newSentence
}

test1 = ["Lawmakers, flight attendants, passengers oppose TSA proposal to cut screening at airports first reported by CNN", "proposal"];

console.log(nearestFourWords(test1[0], test1[1]));

如果您稍后想删除像 the 这样的词,只需在拆分之前添加这些行!

如果你want/need使用正则表达式,那么这里有一个简单的方法。

const sentence = 'I was walking down the street when, suddenly, the TV came on.'
const keyword = 'street';
const regex = `\w+\W+\w+\W+${keyword}\W+\w+\W+\w+`;

console.log(sentence.match(regex));

将其重构为一个函数很快就会显示出一个缺点,即如果关键字位于字符串开头或结尾的两个单词内,则搜索将失败。

const sentence = 'I was walking down the street when, suddenly, the TV came on.'

console.log({
  street: keywordSearch(sentence, 'street'),
  I: keywordSearch(sentence, 'I'),
  was: keywordSearch(sentence, 'was'),
  came: keywordSearch(sentence, 'came'),
  on: keywordSearch(sentence, 'on')
});

function keywordSearch(str, key) {
  const regex = `\w+\W+\w+\W+${key}\W+\w+\W+\w+`;
  
  return str.match(regex);
}

这可以通过使用可选分组来缓解。

const sentence = 'I was walking down the street when, suddenly, the TV came on.'

console.log({
  street: keywordSearch(sentence, 'street'),
  I: keywordSearch(sentence, 'I'),
  was: keywordSearch(sentence, 'was'),
  came: keywordSearch(sentence, 'came'),
  on: keywordSearch(sentence, 'on')
});

function keywordSearch(str, key) {
  const regex = `(?:\w+\W+|)(?:\w+\W+|)${key}(?:\W+\w+|)(?:\W+\w+|)`;
  
  return str.match(regex);
}

希望这能让你上路。