正则表达式剪切最接近关键字的词
Regex Cut Words nearest to Keyword
我正在尝试在 javascript 中制作关键字提取器,但它也会包含一些上下文。有很多步骤,但大多数都非常简单,只是在段落中的关键字旁边添加了不重要的词。我想把所选关键字两边的两个词连同关键字一起剪掉。例如,如果我有句子
let sentence = 'I was walking down the street when, suddenly, the TV came on.'
关键字是street
,我想从句子中提取down the street when suddenly
。最终我会删除所有停用词(如 the
),但目前我只想提取所有词。我一直在使用正则表达式来尝试实现这一点,但没有成功。这是我的代码:
let removePunc = sentence.replace(/[.,\/#!$%\^&\*;:{}=\-_`~()]/g,""); //Removes the commas and other puncuation that could interfere with my extraction
let removeSpace = removePunc.replace(/\s{2,}/g," "); //Removes additional whitespace that's not required
const regex = new RegExp('([^\s]+\s[^\s]+\s' + keyword + '\s[^\s]+\s[^\s]+)', 'gs') //Here's where I was trying to get the two words on either side of the keyword, although it currently doesn't work
let keywordZone = regex.exec(removeSpace); //This is where the regex above should "cut out" the phrase I want
我对正则表达式不是很好,我有点困惑为什么它不能正常工作,因为它似乎适用于 this regex simulator.
上的特定示例
如果我现在尝试,它什么也做不了。例如,句子 Lawmakers, flight attendants, passengers oppose TSA proposal to cut screening at airports first reported by CNN
和关键字 proposal
根本不做任何事情。
提前感谢大家的回复,非常感谢!
删除标点符号后,您可以将句子拆分为每个 space 和 select 数组中单词前后的两个元素:
let sentence = 'I was walking down the street when, suddenly, the TV came on.'
let keyword = "street";
let removePunc = sentence.replace(/[.,\/#!$%\^&\*;:{}=\-_`~()]/g,""); //Removes the commas and other puncuation that could interfere with my extraction
let wordArr = removePunc.split(" ");
let keyPos = wordArr.indexOf(keyword);
let newSentence = [wordArr[keyPos-2], wordArr[keyPos-1], wordArr[keyPos], wordArr[keyPos+1], wordArr[keyPos+2],].join(" ");
console.log(newSentence)
如果你把它放到一个函数中,你也可以很容易地在其他字符串上测试它:
function nearestFourWords(sentence, keyword) {
let removePunc = sentence.replace(/[.,\/#!$%\^&\*;:{}=\-_`~()]/g, ""); //Removes the commas and other puncuation that could interfere with my extraction
let wordArr = removePunc.split(" ");
let keyPos = wordArr.indexOf(keyword);
let newSentence = [wordArr[keyPos - 2], wordArr[keyPos - 1], wordArr[keyPos], wordArr[keyPos + 1], wordArr[keyPos + 2], ].join(" ");
return newSentence
}
test1 = ["Lawmakers, flight attendants, passengers oppose TSA proposal to cut screening at airports first reported by CNN", "proposal"];
console.log(nearestFourWords(test1[0], test1[1]));
如果您稍后想删除像 the
这样的词,只需在拆分之前添加这些行!
如果你want/need使用正则表达式,那么这里有一个简单的方法。
const sentence = 'I was walking down the street when, suddenly, the TV came on.'
const keyword = 'street';
const regex = `\w+\W+\w+\W+${keyword}\W+\w+\W+\w+`;
console.log(sentence.match(regex));
将其重构为一个函数很快就会显示出一个缺点,即如果关键字位于字符串开头或结尾的两个单词内,则搜索将失败。
const sentence = 'I was walking down the street when, suddenly, the TV came on.'
console.log({
street: keywordSearch(sentence, 'street'),
I: keywordSearch(sentence, 'I'),
was: keywordSearch(sentence, 'was'),
came: keywordSearch(sentence, 'came'),
on: keywordSearch(sentence, 'on')
});
function keywordSearch(str, key) {
const regex = `\w+\W+\w+\W+${key}\W+\w+\W+\w+`;
return str.match(regex);
}
这可以通过使用可选分组来缓解。
const sentence = 'I was walking down the street when, suddenly, the TV came on.'
console.log({
street: keywordSearch(sentence, 'street'),
I: keywordSearch(sentence, 'I'),
was: keywordSearch(sentence, 'was'),
came: keywordSearch(sentence, 'came'),
on: keywordSearch(sentence, 'on')
});
function keywordSearch(str, key) {
const regex = `(?:\w+\W+|)(?:\w+\W+|)${key}(?:\W+\w+|)(?:\W+\w+|)`;
return str.match(regex);
}
希望这能让你上路。
我正在尝试在 javascript 中制作关键字提取器,但它也会包含一些上下文。有很多步骤,但大多数都非常简单,只是在段落中的关键字旁边添加了不重要的词。我想把所选关键字两边的两个词连同关键字一起剪掉。例如,如果我有句子
let sentence = 'I was walking down the street when, suddenly, the TV came on.'
关键字是street
,我想从句子中提取down the street when suddenly
。最终我会删除所有停用词(如 the
),但目前我只想提取所有词。我一直在使用正则表达式来尝试实现这一点,但没有成功。这是我的代码:
let removePunc = sentence.replace(/[.,\/#!$%\^&\*;:{}=\-_`~()]/g,""); //Removes the commas and other puncuation that could interfere with my extraction
let removeSpace = removePunc.replace(/\s{2,}/g," "); //Removes additional whitespace that's not required
const regex = new RegExp('([^\s]+\s[^\s]+\s' + keyword + '\s[^\s]+\s[^\s]+)', 'gs') //Here's where I was trying to get the two words on either side of the keyword, although it currently doesn't work
let keywordZone = regex.exec(removeSpace); //This is where the regex above should "cut out" the phrase I want
我对正则表达式不是很好,我有点困惑为什么它不能正常工作,因为它似乎适用于 this regex simulator.
上的特定示例如果我现在尝试,它什么也做不了。例如,句子 Lawmakers, flight attendants, passengers oppose TSA proposal to cut screening at airports first reported by CNN
和关键字 proposal
根本不做任何事情。
提前感谢大家的回复,非常感谢!
删除标点符号后,您可以将句子拆分为每个 space 和 select 数组中单词前后的两个元素:
let sentence = 'I was walking down the street when, suddenly, the TV came on.'
let keyword = "street";
let removePunc = sentence.replace(/[.,\/#!$%\^&\*;:{}=\-_`~()]/g,""); //Removes the commas and other puncuation that could interfere with my extraction
let wordArr = removePunc.split(" ");
let keyPos = wordArr.indexOf(keyword);
let newSentence = [wordArr[keyPos-2], wordArr[keyPos-1], wordArr[keyPos], wordArr[keyPos+1], wordArr[keyPos+2],].join(" ");
console.log(newSentence)
如果你把它放到一个函数中,你也可以很容易地在其他字符串上测试它:
function nearestFourWords(sentence, keyword) {
let removePunc = sentence.replace(/[.,\/#!$%\^&\*;:{}=\-_`~()]/g, ""); //Removes the commas and other puncuation that could interfere with my extraction
let wordArr = removePunc.split(" ");
let keyPos = wordArr.indexOf(keyword);
let newSentence = [wordArr[keyPos - 2], wordArr[keyPos - 1], wordArr[keyPos], wordArr[keyPos + 1], wordArr[keyPos + 2], ].join(" ");
return newSentence
}
test1 = ["Lawmakers, flight attendants, passengers oppose TSA proposal to cut screening at airports first reported by CNN", "proposal"];
console.log(nearestFourWords(test1[0], test1[1]));
如果您稍后想删除像 the
这样的词,只需在拆分之前添加这些行!
如果你want/need使用正则表达式,那么这里有一个简单的方法。
const sentence = 'I was walking down the street when, suddenly, the TV came on.'
const keyword = 'street';
const regex = `\w+\W+\w+\W+${keyword}\W+\w+\W+\w+`;
console.log(sentence.match(regex));
将其重构为一个函数很快就会显示出一个缺点,即如果关键字位于字符串开头或结尾的两个单词内,则搜索将失败。
const sentence = 'I was walking down the street when, suddenly, the TV came on.'
console.log({
street: keywordSearch(sentence, 'street'),
I: keywordSearch(sentence, 'I'),
was: keywordSearch(sentence, 'was'),
came: keywordSearch(sentence, 'came'),
on: keywordSearch(sentence, 'on')
});
function keywordSearch(str, key) {
const regex = `\w+\W+\w+\W+${key}\W+\w+\W+\w+`;
return str.match(regex);
}
这可以通过使用可选分组来缓解。
const sentence = 'I was walking down the street when, suddenly, the TV came on.'
console.log({
street: keywordSearch(sentence, 'street'),
I: keywordSearch(sentence, 'I'),
was: keywordSearch(sentence, 'was'),
came: keywordSearch(sentence, 'came'),
on: keywordSearch(sentence, 'on')
});
function keywordSearch(str, key) {
const regex = `(?:\w+\W+|)(?:\w+\W+|)${key}(?:\W+\w+|)(?:\W+\w+|)`;
return str.match(regex);
}
希望这能让你上路。