用于查找不在 HTML link 元素内的所有子字符串的正则表达式
Regular expression for finding all sub strings which are NOT inside a HTML link element
假设我有以下 HTML 片段:
Deep Work
<p>
<a data-href="Deep Work" href="Deep Work" class="internal-link"
target="_blank" rel="noopener">Deep Work</a>
</p>
Deep Work
<a href="blabla">Some other text</a>
哪个正则表达式将只匹配完全位于 a 块外部 的两个“Deep Work”文本片段?所以,只有在这个截图中标记为黄色的(不是红色的):
我尝试了多种方法,但总能找到最后一个红色的匹配项。我需要避免。因此,我将不胜感激社区的任何帮助。谢谢!
更新:
不幸的是,我使用换行符过度简化了上面的 HTML 代码,以使其在 Whosebug 中可读。这是更好的用例:
<p><a data-href="Deep Work" href="Deep Work" class="internal-link" target="_blank" rel="noopener">Deep Work</a> Deep Work <a data-href="Deep Work" href="Deep Work" class="internal-link" target="_blank" rel="noopener">Deep Work</a> Deep Work </p>
同样只有两个“深度工作”提到在之外,任何 A 块都应该被 RegExp 找到。
»Again only the two "Deep Work" mentions outside any A-block should be found by the RegExp.«
由于 OP 的示例清楚地表明,OP 希望仅匹配 first-level text-nodes 中的任何节点值(或文本内容),基于 DOMParser.parseFromString
的解决方案可能看起来类似于下一个提供的示例代码...
const sampleMarkup =
`Deep Work 1
<p>
<a data-href="Deep Work" href="Deep Work" class="internal-link"
target="_blank" rel="noopener">Deep Work</a>
</p>
Deep Work 2
<a href="blabla">Some other text</a>`;
console.log(
Array
// make an array from ...
.from(
(new DOMParser)
// ... a parsed document ...
.parseFromString(sampleMarkup, 'text/html')
// ... body's ...
.querySelector('body')
// ... child nodes ...
.childNodes
)
// ... and filter just the first level text nodes in order to ...
.filter(node => node.nodeType === 3)
// ... retrieve each matching text node's sanitized/trimmed text content.
.map(node => node.nodeValue.trim())
);
.as-console-wrapper { min-height: 100%!important; top: 0; }
从以上评论...
»The last edit, which provides the one-liner markup, implicitly changes the requirements for it is not equal to the before provided formatted html code. There is a difference in matching just all first level text node values (formatted code example) and matching any text node value which is not part of an <a/>
element (the one-liner markup).«
...但如前所述...
»The task described by the OP is nothing that should be solved by regex (nor can a pure regex based approach assure 100% reliability on that matter). The OP should consider a DOMParser
based approach.«
...由于可靠的方法,重构可以很容易地实现...
// `<p>
// <a data-href="Deep Work" href="Deep Work" class="internal-link" target="_blank" rel="noopener">
// Deep Work
// </a>
// Deep Work 1
// <a data-href="Deep Work" href="Deep Work" class="internal-link" target="_blank" rel="noopener">
// Deep Work
// </a>
// Deep Work 2
// </p>`
const sampleMarkup = '<p><a data-href="Deep Work" href="Deep Work" class="internal-link" target="_blank" rel="noopener">Deep Work</a>Deep Work 1<a data-href="Deep Work" href="Deep Work" class="internal-link" target="_blank" rel="noopener">Deep Work</a>Deep Work 2</p>';
function collectTextNodes(textNodeList, node) {
const nodeType = node?.nodeType;
if (nodeType === 1) {
[...node.childNodes]
.reduce(collectTextNodes, textNodeList);
} else if (nodeType === 3) {
textNodeList.push(node);
}
return textNodeList;
}
console.log(
// ... collect any text node from within ...
collectTextNodes(
[],
(new DOMParser)
// ... a parsed document's ...
.parseFromString(sampleMarkup, 'text/html')
// ... body ...
.querySelector('body')
)
// ... and filter any text node which is not located within an `<a/>` element ...
.filter(textNode =>
textNode.parentNode.closest('a') === null
)
// ... and retrieve each matching text node's sanitized/trimmed text content.
.map(node =>
node.nodeValue.trim()
)
);
.as-console-wrapper { min-height: 100%!important; top: 0; }
假设我有以下 HTML 片段:
Deep Work
<p>
<a data-href="Deep Work" href="Deep Work" class="internal-link"
target="_blank" rel="noopener">Deep Work</a>
</p>
Deep Work
<a href="blabla">Some other text</a>
哪个正则表达式将只匹配完全位于 a 块外部 的两个“Deep Work”文本片段?所以,只有在这个截图中标记为黄色的(不是红色的):
我尝试了多种方法,但总能找到最后一个红色的匹配项。我需要避免。因此,我将不胜感激社区的任何帮助。谢谢!
更新: 不幸的是,我使用换行符过度简化了上面的 HTML 代码,以使其在 Whosebug 中可读。这是更好的用例:
<p><a data-href="Deep Work" href="Deep Work" class="internal-link" target="_blank" rel="noopener">Deep Work</a> Deep Work <a data-href="Deep Work" href="Deep Work" class="internal-link" target="_blank" rel="noopener">Deep Work</a> Deep Work </p>
同样只有两个“深度工作”提到在之外,任何 A 块都应该被 RegExp 找到。
»Again only the two "Deep Work" mentions outside any A-block should be found by the RegExp.«
由于 OP 的示例清楚地表明,OP 希望仅匹配 first-level text-nodes 中的任何节点值(或文本内容),基于 DOMParser.parseFromString
的解决方案可能看起来类似于下一个提供的示例代码...
const sampleMarkup =
`Deep Work 1
<p>
<a data-href="Deep Work" href="Deep Work" class="internal-link"
target="_blank" rel="noopener">Deep Work</a>
</p>
Deep Work 2
<a href="blabla">Some other text</a>`;
console.log(
Array
// make an array from ...
.from(
(new DOMParser)
// ... a parsed document ...
.parseFromString(sampleMarkup, 'text/html')
// ... body's ...
.querySelector('body')
// ... child nodes ...
.childNodes
)
// ... and filter just the first level text nodes in order to ...
.filter(node => node.nodeType === 3)
// ... retrieve each matching text node's sanitized/trimmed text content.
.map(node => node.nodeValue.trim())
);
.as-console-wrapper { min-height: 100%!important; top: 0; }
从以上评论...
»The last edit, which provides the one-liner markup, implicitly changes the requirements for it is not equal to the before provided formatted html code. There is a difference in matching just all first level text node values (formatted code example) and matching any text node value which is not part of an
<a/>
element (the one-liner markup).«
...但如前所述...
»The task described by the OP is nothing that should be solved by regex (nor can a pure regex based approach assure 100% reliability on that matter). The OP should consider a
DOMParser
based approach.«
...由于可靠的方法,重构可以很容易地实现...
// `<p>
// <a data-href="Deep Work" href="Deep Work" class="internal-link" target="_blank" rel="noopener">
// Deep Work
// </a>
// Deep Work 1
// <a data-href="Deep Work" href="Deep Work" class="internal-link" target="_blank" rel="noopener">
// Deep Work
// </a>
// Deep Work 2
// </p>`
const sampleMarkup = '<p><a data-href="Deep Work" href="Deep Work" class="internal-link" target="_blank" rel="noopener">Deep Work</a>Deep Work 1<a data-href="Deep Work" href="Deep Work" class="internal-link" target="_blank" rel="noopener">Deep Work</a>Deep Work 2</p>';
function collectTextNodes(textNodeList, node) {
const nodeType = node?.nodeType;
if (nodeType === 1) {
[...node.childNodes]
.reduce(collectTextNodes, textNodeList);
} else if (nodeType === 3) {
textNodeList.push(node);
}
return textNodeList;
}
console.log(
// ... collect any text node from within ...
collectTextNodes(
[],
(new DOMParser)
// ... a parsed document's ...
.parseFromString(sampleMarkup, 'text/html')
// ... body ...
.querySelector('body')
)
// ... and filter any text node which is not located within an `<a/>` element ...
.filter(textNode =>
textNode.parentNode.closest('a') === null
)
// ... and retrieve each matching text node's sanitized/trimmed text content.
.map(node =>
node.nodeValue.trim()
)
);
.as-console-wrapper { min-height: 100%!important; top: 0; }