除了将引用的段视为单个标记外,如何对空格上的句子进行标记?
How to tokenize a sentence splitting on spaces, except treat quoted segments as a single token?
例如我想拆分下面的句子:
(Quick brown "fox jumps (over)") the lazy dog and (looks for food)
预期输出数组:
["(Quick","brown","fox jumps (over)",")the","lazy","dog","and","(looks","for","food)"]
我在 typescript playground 中尝试过这个简单的函数:
const tokenizeSentenceText = (sentence: any = '') => {
let wordList = [];
wordList = sentence.match(/\?.|^$/g).reduce((p: any, c: any) => {
if (c === '"') {
p.quote ^= 1;
} else if (!p.quote && c === ' ') {
p.a.push('');
} else {
p.a[p.a.length - 1] += c.replace(/\(.)/, "");
}
return p;
}, { a: [''] }).a;
return wordList; }
得到这样的输出:
["(Quick", "brown", "fox jumps (over))", "the", "lazy", "dog", "and", "(looks", "for", "food)"]
如您所见,“fox jumps (over))”,双引号外写的最后一个右括号与单词 (over)) 而不是 (over) 并排,引号后的最后一个右括号应该实际上转到下一个单词")the"
注意:双引号“”内的任何内容都应视为单个单词。双引号内可以有多个spaces/brackets。
提前感谢您的帮助。
你可以实际使用
const tokenizeSentenceText = (sentence) => {
return sentence.match(/"[^"]*"|[^\s"]+/g);
}
// If the double quotes need removing
const tokenizeSentenceTextNoQuotes = (sentence) => {
return Array.from(sentence.matchAll(/"([^"]*)"|[^\s"]+/g), (x) => x[1] ?? x[0]);
}
const text = '(Quick brown "fox jumps (over)") the lazy dog and (looks for food)';
console.log(tokenizeSentenceText(text))
console.log(tokenizeSentenceTextNoQuotes(text))
正则表达式匹配
"([^"]*)"
- "
字符,除 "
之外的任何零个或多个字符,然后是 "
字符
|
- 或者
-[^\s"]+
- 除了空格和 "
个字符之外的一个或多个字符。
Array.from(sentence.matchAll(/"([^"]*)"|[^\s"]+/g), (x) => x[1] ?? x[0])
returns 组 1 中的 (x) => x[1] ?? x[0]
值(如果该选项匹配),否则返回整个匹配(与 [^\s"]+
匹配的内容)。
例如我想拆分下面的句子:
(Quick brown "fox jumps (over)") the lazy dog and (looks for food)
预期输出数组:
["(Quick","brown","fox jumps (over)",")the","lazy","dog","and","(looks","for","food)"]
我在 typescript playground 中尝试过这个简单的函数:
const tokenizeSentenceText = (sentence: any = '') => {
let wordList = [];
wordList = sentence.match(/\?.|^$/g).reduce((p: any, c: any) => {
if (c === '"') {
p.quote ^= 1;
} else if (!p.quote && c === ' ') {
p.a.push('');
} else {
p.a[p.a.length - 1] += c.replace(/\(.)/, "");
}
return p;
}, { a: [''] }).a;
return wordList; }
得到这样的输出:
["(Quick", "brown", "fox jumps (over))", "the", "lazy", "dog", "and", "(looks", "for", "food)"]
如您所见,“fox jumps (over))”,双引号外写的最后一个右括号与单词 (over)) 而不是 (over) 并排,引号后的最后一个右括号应该实际上转到下一个单词")the"
注意:双引号“”内的任何内容都应视为单个单词。双引号内可以有多个spaces/brackets。
提前感谢您的帮助。
你可以实际使用
const tokenizeSentenceText = (sentence) => {
return sentence.match(/"[^"]*"|[^\s"]+/g);
}
// If the double quotes need removing
const tokenizeSentenceTextNoQuotes = (sentence) => {
return Array.from(sentence.matchAll(/"([^"]*)"|[^\s"]+/g), (x) => x[1] ?? x[0]);
}
const text = '(Quick brown "fox jumps (over)") the lazy dog and (looks for food)';
console.log(tokenizeSentenceText(text))
console.log(tokenizeSentenceTextNoQuotes(text))
正则表达式匹配
"([^"]*)"
-"
字符,除"
之外的任何零个或多个字符,然后是"
字符|
- 或者 -[^\s"]+
- 除了空格和"
个字符之外的一个或多个字符。
Array.from(sentence.matchAll(/"([^"]*)"|[^\s"]+/g), (x) => x[1] ?? x[0])
returns 组 1 中的 (x) => x[1] ?? x[0]
值(如果该选项匹配),否则返回整个匹配(与 [^\s"]+
匹配的内容)。