除了将引用的段视为单个标记外,如何对空格上的句子进行标记?

How to tokenize a sentence splitting on spaces, except treat quoted segments as a single token?

例如我想拆分下面的句子:

(Quick brown "fox jumps (over)") the lazy dog and (looks for food)

预期输出数组:

["(Quick","brown","fox jumps (over)",")the","lazy","dog","and","(looks","for","food)"]

我在 typescript playground 中尝试过这个简单的函数:

const tokenizeSentenceText = (sentence: any = '') => {
 let wordList = [];

  wordList = sentence.match(/\?.|^$/g).reduce((p: any, c: any) => {
    if (c === '"') {
        p.quote ^= 1;
    } else if (!p.quote && c === ' ') {
        p.a.push('');
    } else {
        p.a[p.a.length - 1] += c.replace(/\(.)/, "");
    }
    return p;
}, { a: [''] }).a;

return wordList; }

得到这样的输出:

["(Quick", "brown", "fox jumps (over))", "the", "lazy", "dog", "and", "(looks", "for", "food)"]

如您所见,“fox jumps (over))”,双引号外写的最后一个右括号与单词 (over)) 而不是 (over) 并排,引号后的最后一个右括号应该实际上转到下一个单词")the"

注意:双引号“”内的任何内容都应视为单个单词。双引号内可以有多个spaces/brackets。

提前感谢您的帮助。

你可以实际使用

const tokenizeSentenceText = (sentence) => {
  return sentence.match(/"[^"]*"|[^\s"]+/g);
}
// If the double quotes need removing 
const tokenizeSentenceTextNoQuotes = (sentence) => {
  return Array.from(sentence.matchAll(/"([^"]*)"|[^\s"]+/g), (x) => x[1] ?? x[0]);
}

const text = '(Quick brown "fox     jumps (over)") the lazy dog and (looks for food)';
console.log(tokenizeSentenceText(text))
console.log(tokenizeSentenceTextNoQuotes(text))

正则表达式匹配

  • "([^"]*)" - " 字符,除 " 之外的任何零个或多个字符,然后是 " 字符
  • | - 或者 -[^\s"]+ - 除了空格和 " 个字符之外的一个或多个字符。

Array.from(sentence.matchAll(/"([^"]*)"|[^\s"]+/g), (x) => x[1] ?? x[0]) returns 组 1 中的 (x) => x[1] ?? x[0] 值(如果该选项匹配),否则返回整个匹配(与 [^\s"]+ 匹配的内容)。