使用 JavaScript,如何将 HTML 字符串转换为 HTML 标签和文本内容的数组?
Using JavaScript, how do I transform an HTML string into an array of HTML tags and text content?
我有一个 HTML 字符串,例如:
<p>
<strong><em>Lorem Ipsum </em></strong>is simply dummy text of the printing <em>and</em> typesetting industry.
</p>
我想将其转换为 JavaScript 数组,如下所示:
['<p>', '<strong>', '<em>', 'Lorem Ipsum ', '</em>', '</strong>', 'is simply dummy text of the printing ', '<em>', 'and', '</em>', 'typesetting industry.', '</p>']
即它采用 HTML 字符串并将其分解为标签数组和 HTML 内容。
我已尝试根据 问题使用 DomParser()
:
const str = `<p><strong><em>Lorem Ipsum </em></strong>is simply dummy text of the printing <em>and</em> typesetting industry.</p>`;
const doc = new DOMParser().parseFromString(str, 'text/html');
const arr = [...doc.body.childNodes]
.map(child => child.outerHTML || child.textContent);
不过,这简直returns:
['<p><strong><em>Lorem Ipsum </em></strong>is simply dummy text of the printing <em>and</em> typesetting industry.</p>']
我也曾尝试搜索各种基于 Regex 的解决方案,但未能找到任何完全按照我的要求分解字符串的解决方案。
有什么建议吗?
谢谢
我将创建一个递归函数来遍历给定节点和 return 其子节点的文本表示数组:
const str = `<p><strong><em>Lorem Ipsum </em></strong>is simply dummy text of the printing <em>and</em> typesetting industry.</p>`;
const doc = new DOMParser().parseFromString(str, 'text/html');
const parseNode = node => {
const output = [];
for (const child of node.childNodes) {
if (child.nodeType === Node.TEXT_NODE) {
output.push(child.textContent);
} else if (child.nodeType === Node.ELEMENT_NODE) {
output.push(`<${child.tagName}>`);
output.push(...parseNode(child));
output.push(`</${child.tagName}>`);
}
}
return output;
};
console.log(parseNode(doc.body));
如果你也需要保留属性,你可以取元素的 outerHTML
并取前导非括号:
const str = `<p style="color:green"><strong><em>Lorem Ipsum </em></strong>is simply dummy text of the printing <em>and</em> typesetting industry.</p>`;
const doc = new DOMParser().parseFromString(str, 'text/html');
const parseNode = node => {
const output = [];
for (const child of node.childNodes) {
if (child.nodeType === Node.TEXT_NODE) {
output.push(child.textContent);
} else if (child.nodeType === Node.ELEMENT_NODE) {
const attribs = child.outerHTML.match(/<\s*[^>\s]+([^>]*)/)[1];
output.push(`<${child.tagName}${attribs}>`);
output.push(...parseNode(child));
output.push(`</${child.tagName}>`);
}
}
return output;
};
console.log(parseNode(doc.body));
如果需要自闭标签不展开,检查元素的outerHTML
是否包含</
:
const str = `<p style="color:green"><input readonly value="x"/><strong><em>Lorem Ipsum </em></strong>is simply dummy text of the printing <em>and</em> typesetting industry.</p>`;
const doc = new DOMParser().parseFromString(str, 'text/html');
const parseNode = node => {
const output = [];
for (const child of node.childNodes) {
if (child.nodeType === Node.TEXT_NODE) {
output.push(child.textContent);
} else if (child.nodeType === Node.ELEMENT_NODE) {
const attribs = child.outerHTML.match(/<\s*[^>\s]+([^>]*)/)[1];
output.push(`<${child.tagName}${attribs}>`);
if (child.outerHTML.includes('</')) {
// Not self closing:
output.push(...parseNode(child));
output.push(`</${child.tagName}>`);
}
}
}
return output;
};
console.log(parseNode(doc.body));
我有一个 HTML 字符串,例如:
<p>
<strong><em>Lorem Ipsum </em></strong>is simply dummy text of the printing <em>and</em> typesetting industry.
</p>
我想将其转换为 JavaScript 数组,如下所示:
['<p>', '<strong>', '<em>', 'Lorem Ipsum ', '</em>', '</strong>', 'is simply dummy text of the printing ', '<em>', 'and', '</em>', 'typesetting industry.', '</p>']
即它采用 HTML 字符串并将其分解为标签数组和 HTML 内容。
我已尝试根据 DomParser()
:
const str = `<p><strong><em>Lorem Ipsum </em></strong>is simply dummy text of the printing <em>and</em> typesetting industry.</p>`;
const doc = new DOMParser().parseFromString(str, 'text/html');
const arr = [...doc.body.childNodes]
.map(child => child.outerHTML || child.textContent);
不过,这简直returns:
['<p><strong><em>Lorem Ipsum </em></strong>is simply dummy text of the printing <em>and</em> typesetting industry.</p>']
我也曾尝试搜索各种基于 Regex 的解决方案,但未能找到任何完全按照我的要求分解字符串的解决方案。
有什么建议吗?
谢谢
我将创建一个递归函数来遍历给定节点和 return 其子节点的文本表示数组:
const str = `<p><strong><em>Lorem Ipsum </em></strong>is simply dummy text of the printing <em>and</em> typesetting industry.</p>`;
const doc = new DOMParser().parseFromString(str, 'text/html');
const parseNode = node => {
const output = [];
for (const child of node.childNodes) {
if (child.nodeType === Node.TEXT_NODE) {
output.push(child.textContent);
} else if (child.nodeType === Node.ELEMENT_NODE) {
output.push(`<${child.tagName}>`);
output.push(...parseNode(child));
output.push(`</${child.tagName}>`);
}
}
return output;
};
console.log(parseNode(doc.body));
如果你也需要保留属性,你可以取元素的 outerHTML
并取前导非括号:
const str = `<p style="color:green"><strong><em>Lorem Ipsum </em></strong>is simply dummy text of the printing <em>and</em> typesetting industry.</p>`;
const doc = new DOMParser().parseFromString(str, 'text/html');
const parseNode = node => {
const output = [];
for (const child of node.childNodes) {
if (child.nodeType === Node.TEXT_NODE) {
output.push(child.textContent);
} else if (child.nodeType === Node.ELEMENT_NODE) {
const attribs = child.outerHTML.match(/<\s*[^>\s]+([^>]*)/)[1];
output.push(`<${child.tagName}${attribs}>`);
output.push(...parseNode(child));
output.push(`</${child.tagName}>`);
}
}
return output;
};
console.log(parseNode(doc.body));
如果需要自闭标签不展开,检查元素的outerHTML
是否包含</
:
const str = `<p style="color:green"><input readonly value="x"/><strong><em>Lorem Ipsum </em></strong>is simply dummy text of the printing <em>and</em> typesetting industry.</p>`;
const doc = new DOMParser().parseFromString(str, 'text/html');
const parseNode = node => {
const output = [];
for (const child of node.childNodes) {
if (child.nodeType === Node.TEXT_NODE) {
output.push(child.textContent);
} else if (child.nodeType === Node.ELEMENT_NODE) {
const attribs = child.outerHTML.match(/<\s*[^>\s]+([^>]*)/)[1];
output.push(`<${child.tagName}${attribs}>`);
if (child.outerHTML.includes('</')) {
// Not self closing:
output.push(...parseNode(child));
output.push(`</${child.tagName}>`);
}
}
}
return output;
};
console.log(parseNode(doc.body));