拆分包含不同语言单词的段落
Split a paragraph containing words in different languages
给定输入
let sentence = `browser's
emoji
rød
continuïteit a-b c+d
D-er går en
المسجد الحرام
٠١٢٣٤٥٦٧٨٩
তার মধ্যে আশ্চর্য`;
需要输出
我希望每个单词和空格都包含在 <span>
中 表示 它是一个单词或 space
每个 <span>
的类型属性的值为:
- w 为单词
- t 表示 space 或非单词
例子
<span type="w">D</span><span type="t">-</span>
<span type="w">er</span><span type="t"> </span>
<span type="w">går</span>
<span type="t"> </span><span type="w">en</span>
<span type="w">المسجد</span>
<span type="t"> </span><span type="w">الحرام</span>
<span type="t"> </span>
<span type="w">তার</span><span type="t"> </span>
<span type="w">মধ্যে</span><span type="t"> </span>
<span type="w">আশ্চর্য</span>
调查的想法
搜索堆栈交换
Unicode string with diacritics split by chars lead me to answer that for using Unicode properties Grapheme_Base
使用 split(/\w/)
和 split(/\W/)
字边界。
根据 MDN 报告,在 ASCII 上进行拆分 RegEx \w and 'W
\w and \W only matches ASCII based characters; for example, a to z, A to Z, 0 to 9, and _.
使用split("")
使用 sentence.split("")
将表情符号拆分为其 unicode 字节。
Unicode 代码点属性 Grapheme_Base 和 Grapheme_Extend
const matchGrapheme =
/\p{Grapheme_Base}\p{Grapheme_Extend}|\p{Grapheme_Base}/gu;
let result = sentence.match(matchGrapheme);
console.log("Grapheme_Base (+Grapheme_Extend)", result);
拆分每个单词但仍然包含所有内容。
Unicode 属性标点符号和 White_Space
const matchPunctuation = /[\p{Punctuation}|\p{White_Space}]+/ug;
let punctuationAndWhiteSpace = sentence.match(matchPunctuation);
console.log("Punctuation/White_Space", punctuationAndWhiteSpace);
似乎获取了非单词。
通过合并 Grapheme_Base/Grapheme_Extend 和 Punctuation/White_Space 结果,我们可以遍历整个字素拆分内容并使用标点符号列表
let sentence = `browser's
emoji
rød
continuïteit a-b c+d
D-er går en
المسجد الحرام
٠١٢٣٤٥٦٧٨٩
তার মধ্যে আশ্চর্য`;
const matchGrapheme = /\p{Grapheme_Base}\p{Grapheme_Extend}|\p{Grapheme_Base}/gu;
const matchPunctuation = /\p{Punctuation}|\p{White_Space}/ug;
sentence.split(/\n|\r\n/).forEach((v, i) => {
console.log(`Line ${i} ${v}`);
const graphs = v.match(matchGrapheme);
const puncts = v.match(matchPunctuation) || [];
console.log(graphs, puncts);
const words = [];
let word = "";
const items = [];
graphs.forEach((v, i, a) => {
const char = v;
if (puncts.length > 0 && char === puncts[0]) {
words.push(word);
items.push({ type: "w", value: "" + word });
word = "";
items.push({ type: "t", value: "" + v });
puncts.shift();
} else {
word += char;
}
});
if (word) {
words.push(word);
items.push({ type: "w", value: "" + word });
}
console.log("Words", words.join(" || "));
console.log("Items", items[0]);
// Rejoin wrapped in '<span>'
const l = items.map((v) => `<span type="${v.type}">${v.value}</span>`).join(
"",
);
console.log(l);
});
您也可以结合使用 replace()
、split()
和 join()
。
const sentence = `browser's
emoji
rød
continuïteit a-b c+d
D-er går en
المسجد الحرام
٠١٢٣٤٥٦٧٨٩
তার মধ্যে আশ্চর্য`;
const splitP = (sentence) => {
const oneLine = sentence.replace(/[\r\n]/g, " "); // replace all \r\ns by spaces
const splitted = oneLine.split(" ").filter(x => x); // split & filter out falsy values
return `<span>${splitted.join("</span><span>")}</span>`; // join with span tags
}
console.log(splitP(sentence));
如果您喜欢 one-line 解决方案。
const sentence = `browser's
emoji
rød
continuïteit a-b c+d
D-er går en
المسجد الحرام
٠١٢٣٤٥٦٧٨٩
তার মধ্যে আশ্চর্য`;
const result = `<span>${sentence.replace(/[\r\n]/g, " ").split(" ").filter(x => x).join("</span><span>")}</span>`;
console.log(result);
给定输入
let sentence = `browser's
emoji
rød
continuïteit a-b c+d
D-er går en
المسجد الحرام
٠١٢٣٤٥٦٧٨٩
তার মধ্যে আশ্চর্য`;
需要输出
我希望每个单词和空格都包含在 <span>
中 表示 它是一个单词或 space
每个 <span>
的类型属性的值为:
- w 为单词
- t 表示 space 或非单词
例子
<span type="w">D</span><span type="t">-</span>
<span type="w">er</span><span type="t"> </span>
<span type="w">går</span>
<span type="t"> </span><span type="w">en</span>
<span type="w">المسجد</span>
<span type="t"> </span><span type="w">الحرام</span>
<span type="t"> </span>
<span type="w">তার</span><span type="t"> </span>
<span type="w">মধ্যে</span><span type="t"> </span>
<span type="w">আশ্চর্য</span>
调查的想法
搜索堆栈交换
Unicode string with diacritics split by chars lead me to answer that for using Unicode properties Grapheme_Base
使用 split(/\w/)
和 split(/\W/)
字边界。
根据 MDN 报告,在 ASCII 上进行拆分 RegEx \w and 'W
\w and \W only matches ASCII based characters; for example, a to z, A to Z, 0 to 9, and _.
使用split("")
使用 sentence.split("")
将表情符号拆分为其 unicode 字节。
Unicode 代码点属性 Grapheme_Base 和 Grapheme_Extend
const matchGrapheme =
/\p{Grapheme_Base}\p{Grapheme_Extend}|\p{Grapheme_Base}/gu;
let result = sentence.match(matchGrapheme);
console.log("Grapheme_Base (+Grapheme_Extend)", result);
拆分每个单词但仍然包含所有内容。
Unicode 属性标点符号和 White_Space
const matchPunctuation = /[\p{Punctuation}|\p{White_Space}]+/ug;
let punctuationAndWhiteSpace = sentence.match(matchPunctuation);
console.log("Punctuation/White_Space", punctuationAndWhiteSpace);
似乎获取了非单词。
通过合并 Grapheme_Base/Grapheme_Extend 和 Punctuation/White_Space 结果,我们可以遍历整个字素拆分内容并使用标点符号列表
let sentence = `browser's
emoji
rød
continuïteit a-b c+d
D-er går en
المسجد الحرام
٠١٢٣٤٥٦٧٨٩
তার মধ্যে আশ্চর্য`;
const matchGrapheme = /\p{Grapheme_Base}\p{Grapheme_Extend}|\p{Grapheme_Base}/gu;
const matchPunctuation = /\p{Punctuation}|\p{White_Space}/ug;
sentence.split(/\n|\r\n/).forEach((v, i) => {
console.log(`Line ${i} ${v}`);
const graphs = v.match(matchGrapheme);
const puncts = v.match(matchPunctuation) || [];
console.log(graphs, puncts);
const words = [];
let word = "";
const items = [];
graphs.forEach((v, i, a) => {
const char = v;
if (puncts.length > 0 && char === puncts[0]) {
words.push(word);
items.push({ type: "w", value: "" + word });
word = "";
items.push({ type: "t", value: "" + v });
puncts.shift();
} else {
word += char;
}
});
if (word) {
words.push(word);
items.push({ type: "w", value: "" + word });
}
console.log("Words", words.join(" || "));
console.log("Items", items[0]);
// Rejoin wrapped in '<span>'
const l = items.map((v) => `<span type="${v.type}">${v.value}</span>`).join(
"",
);
console.log(l);
});
您也可以结合使用 replace()
、split()
和 join()
。
const sentence = `browser's
emoji
rød
continuïteit a-b c+d
D-er går en
المسجد الحرام
٠١٢٣٤٥٦٧٨٩
তার মধ্যে আশ্চর্য`;
const splitP = (sentence) => {
const oneLine = sentence.replace(/[\r\n]/g, " "); // replace all \r\ns by spaces
const splitted = oneLine.split(" ").filter(x => x); // split & filter out falsy values
return `<span>${splitted.join("</span><span>")}</span>`; // join with span tags
}
console.log(splitP(sentence));
如果您喜欢 one-line 解决方案。
const sentence = `browser's
emoji
rød
continuïteit a-b c+d
D-er går en
المسجد الحرام
٠١٢٣٤٥٦٧٨٩
তার মধ্যে আশ্চর্য`;
const result = `<span>${sentence.replace(/[\r\n]/g, " ").split(" ").filter(x => x).join("</span><span>")}</span>`;
console.log(result);