用于查找不在 HTML link 元素内的所有子字符串的正则表达式

Question

假设我有以下 HTML 片段：

Deep Work 
<p>
<a data-href="Deep Work" href="Deep Work" class="internal-link" 
target="_blank" rel="noopener">Deep Work</a>
</p>
Deep Work
<a href="blabla">Some other text</a>

哪个正则表达式将只匹配完全位于 a 块外部的两个“Deep Work”文本片段？所以，只有在这个截图中标记为黄色的（不是红色的）：

我尝试了多种方法，但总能找到最后一个红色的匹配项。我需要避免。因此，我将不胜感激社区的任何帮助。谢谢！

更新：不幸的是，我使用换行符过度简化了上面的 HTML 代码，以使其在 Whosebug 中可读。这是更好的用例：

<p><a data-href="Deep Work" href="Deep Work" class="internal-link" target="_blank" rel="noopener">Deep Work</a> Deep Work <a data-href="Deep Work" href="Deep Work" class="internal-link" target="_blank" rel="noopener">Deep Work</a> Deep Work </p>

同样只有两个“深度工作”提到在之外，任何 A 块都应该被 RegExp 找到。

Answer 1

»Again only the two "Deep Work" mentions outside any A-block should be found by the RegExp.«

由于 OP 的示例清楚地表明，OP 希望仅匹配 first-level text-nodes 中的任何节点值（或文本内容），基于 DOMParser.parseFromString 的解决方案可能看起来类似于下一个提供的示例代码...

const sampleMarkup =
  `Deep Work 1
  <p>
    <a data-href="Deep Work" href="Deep Work" class="internal-link" 
    target="_blank" rel="noopener">Deep Work</a>
  </p>
  Deep Work 2
  <a href="blabla">Some other text</a>`;

console.log(
  Array
    // make an array from ...
    .from(
      (new DOMParser)
        // ... a parsed document ...
        .parseFromString(sampleMarkup, 'text/html')
        // ... body's ...
        .querySelector('body')
        // ... child nodes ...
        .childNodes
    )
    // ... and filter just the first level text nodes in order to ...
    .filter(node => node.nodeType === 3)
    // ... retrieve each matching text node's sanitized/trimmed text content.
    .map(node => node.nodeValue.trim())
);

.as-console-wrapper { min-height: 100%!important; top: 0; }

从以上评论...

»The last edit, which provides the one-liner markup, implicitly changes the requirements for it is not equal to the before provided formatted html code. There is a difference in matching just all first level text node values (formatted code example) and matching any text node value which is not part of an <a/> element (the one-liner markup).«

...但如前所述...

»The task described by the OP is nothing that should be solved by regex (nor can a pure regex based approach assure 100% reliability on that matter). The OP should consider a DOMParser based approach.«

...由于可靠的方法，重构可以很容易地实现...

// `<p>
//   <a data-href="Deep Work" href="Deep Work" class="internal-link" target="_blank" rel="noopener">
//     Deep Work
//   </a>
//   Deep Work 1
//   <a data-href="Deep Work" href="Deep Work" class="internal-link" target="_blank" rel="noopener">
//     Deep Work
//   </a>
//   Deep Work 2
// </p>`

const sampleMarkup = '<p><a data-href="Deep Work" href="Deep Work" class="internal-link" target="_blank" rel="noopener">Deep Work</a>Deep Work 1<a data-href="Deep Work" href="Deep Work" class="internal-link" target="_blank" rel="noopener">Deep Work</a>Deep Work 2</p>';

function collectTextNodes(textNodeList, node) {
  const nodeType = node?.nodeType;
  if (nodeType === 1) {

    [...node.childNodes]
      .reduce(collectTextNodes, textNodeList);

  } else if (nodeType === 3) {

    textNodeList.push(node);
  }
  return textNodeList;
}

console.log(
  // ... collect any text node from within ...
  collectTextNodes(
    [],
    (new DOMParser)
      // ... a parsed document's ...
      .parseFromString(sampleMarkup, 'text/html')
      // ... body ...
      .querySelector('body')
  )
  // ... and filter any text node which is not located within an `<a/>` element ...
  .filter(textNode =>
    textNode.parentNode.closest('a') === null
  )
  // ... and retrieve each matching text node's sanitized/trimmed text content.
  .map(node =>
    node.nodeValue.trim()
  )
);

.as-console-wrapper { min-height: 100%!important; top: 0; }

用于查找不在 HTML link 元素内的所有子字符串的正则表达式

Regular expression for finding all sub strings which are NOT inside a HTML link element

javascript

arrays

recursion

filter

domparser