获取 innerText 并按 <br> 拆分
Get innerText and split by <br>
下面是一些 HTML 的最小示例,我正在尝试为其提取文本内容。我想要的结果是数组 ['keep1', 'keep2', 'keep3', 'keep4', 'keep5']
,所以我要删除 div 的子元素,然后将 div 的文本拆分到 [=14= 的数组中] 标签。
通常我会在 div 上使用 .innerText
这有助于获取所有文本并删除子元素,但据我所知在这种情况下不合适,因为那样我就失去了<br />
个我需要拆分成数组的标签。下面是我能想到的最好的,但不处理子元素没有被 <br />
包围的情况。有没有更好的方法来做到这一点?
const text = document
.querySelector("div")
.innerHTML.split("<br>")
.map(e => e.trim())
.filter(e => e[0] != "<" && e != "");
console.log(text);
<div>
<br /> keep1 <br /> keep2
<span>drop</span> keep3
<br /> keep4
<br />
<h4>drop2</h4>
<br />keep5
</div>
在操作顺序上,先用<br>
标签替换换行符比较容易,先用/\n/g
,再拆分结果。一旦我们处理了我们唯一关心的 html 元素 (<br>
),我们就可以使用正则表达式 /\<(.*)\>/g
去除其余元素
当标签被解析时,<br />
被“标准化为 <br>
实际上让我感到惊讶 - 但正如 this S.O. post 所说,<br />
是 XHTML 并且浏览器将所有内容解析为 HTML <br>
const text = document
.querySelector("div")
.innerHTML.replace(/\n/g,"<br>") // replace all line breaks with `<br>`
.split("<br>")
.map(e => e.replace(/\<(.*)\>/g,'').trim()) // we clean and trim the element from any html tags
.filter(e=>e) // this cleans out the empty array elements
console.log(text);
<div>
<br /> keep1 <br /> keep2
<span>drop</span> keep3
<br /> keep4
<br />
<h4>drop2</h4>
<br />keep5
</div>
一种可能的方法如下:
// we use the spread syntax inside of an Array-literal to convert the
// iterable result of document.querySelector().childNodes into an
// Array:
const text = [...
// here we retrieve the first/only <div> element from the document
// and return the live NodeList of all its child-nodes:
document.querySelector('div').childNodes
// we then use Array.prototype.filter() to filter the returned collection:
].filter(
// we use an Arrow function to test each node passed to the
// Array.prototype.filter() method ('node' is a reference to the current
// node of the Array of nodes;
// node.nodeType: we first test that the node has a nodeType,
// we then assess if the node is a textNode (the nodeType of a text-node
// is 3),
// finally - to prevent empty array-element-values - we check that
// the length of the nodeValue (the text-content of the text-node) once
// leading and trailing white-space is removed has a length greater
// than zero:
(node) => node.nodeType && node.nodeType === 3 && node.nodeValue.trim().length > 0
// we then use Array.prototype.map() to return a new Array based on the existing
// Array of text-nodes:
).map(
// again we pass the array-element into the function,
// and here we trim the leading/trailing white-space of the node's value,
// by passing the string to String.prototype.trim():
(node) => node.nodeValue.trim()
);
console.log(text); // ["keep1","keep2","keep3","keep4","keep5"]
<div>
<br /> keep1 <br /> keep2
<span>drop</span> keep3
<br /> keep4
<br />
<h4>drop2</h4>
<br />keep5
</div>
参考文献:
下面是一些 HTML 的最小示例,我正在尝试为其提取文本内容。我想要的结果是数组 ['keep1', 'keep2', 'keep3', 'keep4', 'keep5']
,所以我要删除 div 的子元素,然后将 div 的文本拆分到 [=14= 的数组中] 标签。
通常我会在 div 上使用 .innerText
这有助于获取所有文本并删除子元素,但据我所知在这种情况下不合适,因为那样我就失去了<br />
个我需要拆分成数组的标签。下面是我能想到的最好的,但不处理子元素没有被 <br />
包围的情况。有没有更好的方法来做到这一点?
const text = document
.querySelector("div")
.innerHTML.split("<br>")
.map(e => e.trim())
.filter(e => e[0] != "<" && e != "");
console.log(text);
<div>
<br /> keep1 <br /> keep2
<span>drop</span> keep3
<br /> keep4
<br />
<h4>drop2</h4>
<br />keep5
</div>
在操作顺序上,先用<br>
标签替换换行符比较容易,先用/\n/g
,再拆分结果。一旦我们处理了我们唯一关心的 html 元素 (<br>
),我们就可以使用正则表达式 /\<(.*)\>/g
当标签被解析时,<br />
被“标准化为 <br>
实际上让我感到惊讶 - 但正如 this S.O. post 所说,<br />
是 XHTML 并且浏览器将所有内容解析为 HTML <br>
const text = document
.querySelector("div")
.innerHTML.replace(/\n/g,"<br>") // replace all line breaks with `<br>`
.split("<br>")
.map(e => e.replace(/\<(.*)\>/g,'').trim()) // we clean and trim the element from any html tags
.filter(e=>e) // this cleans out the empty array elements
console.log(text);
<div>
<br /> keep1 <br /> keep2
<span>drop</span> keep3
<br /> keep4
<br />
<h4>drop2</h4>
<br />keep5
</div>
一种可能的方法如下:
// we use the spread syntax inside of an Array-literal to convert the
// iterable result of document.querySelector().childNodes into an
// Array:
const text = [...
// here we retrieve the first/only <div> element from the document
// and return the live NodeList of all its child-nodes:
document.querySelector('div').childNodes
// we then use Array.prototype.filter() to filter the returned collection:
].filter(
// we use an Arrow function to test each node passed to the
// Array.prototype.filter() method ('node' is a reference to the current
// node of the Array of nodes;
// node.nodeType: we first test that the node has a nodeType,
// we then assess if the node is a textNode (the nodeType of a text-node
// is 3),
// finally - to prevent empty array-element-values - we check that
// the length of the nodeValue (the text-content of the text-node) once
// leading and trailing white-space is removed has a length greater
// than zero:
(node) => node.nodeType && node.nodeType === 3 && node.nodeValue.trim().length > 0
// we then use Array.prototype.map() to return a new Array based on the existing
// Array of text-nodes:
).map(
// again we pass the array-element into the function,
// and here we trim the leading/trailing white-space of the node's value,
// by passing the string to String.prototype.trim():
(node) => node.nodeValue.trim()
);
console.log(text); // ["keep1","keep2","keep3","keep4","keep5"]
<div>
<br /> keep1 <br /> keep2
<span>drop</span> keep3
<br /> keep4
<br />
<h4>drop2</h4>
<br />keep5
</div>
参考文献: