关于从 html 中删除标签的方法的安全问题

Question

我正在使用 findAndReplaceDOMText，一个可让您包装跨越多个标签的文本的库。

考虑将 o b 包装在以下 <em> 标签中 html:

<p>foo <span>bar</span></p>

它生成以下内容：

<p>fo<em>o </em><span><em>b</em>ar</span></p>

效果很好。我担心的是，我删除这些标记的策略可能会打开代码注入的可能性。下面的代码可以工作，我只是担心潜在的代码注入机会，特别是因为我正在开发 chrome 扩展，所以目标页面的 HTML可能格式不正确。

import $ from 'jquery'

export default function clearMarks() {
  $(".deepSearch-highlight").parent().each(function() {
    const contents = []
    const $parent = $(this)
    $parent.contents().each(function() {
      const $node = $(this)
      let html

      if ($node.hasClass("deepSearch-highlight")) {
        html = $node.html()
      }
      else if (this.nodeName === "#text") {
        html = this.data
      }
      else {
        html = this.outerHTML
      }
      contents.push(html)
    })
    $parent.html(contents.join(""))
  })
}

我的目标是将 html 完全恢复到使用 findAndReplaceDOMText 转换之前的状态。在 "additional information" 部分，您可以看到更简单的 clearMarks 函数如何导致文本节点数发生变化。

我的策略是否存在我遗漏的任何安全漏洞？是否有更 secure/more elegant/generally 更好的方法来实现我的目标？

附加信息：

我正在使用 findAndReplaceDOMText 选项 preset: "prose" 其中：

Ignore non-textual elements (E.g. <script>, <svg>, <optgroup>,`, etc.)
顺便说一句，更简单的 $(this).replaceWith($(this).html()) 导致文本节点数量激增。对于上面的示例，我们将得到：<p>"fo""o "<span>"b""ar"</span></p>（其中文本节点用 " 表示）。如果您尝试重新涂抹 findAndReplaceDOMText 除了通常有异味之外，这会导致问题。
插入的 span 元素具有 .deepSearch-highlight 的 class（与上面的示例相反，该示例将文本换行在 em 中。请参阅下面的完整代码。

.

import $ from "jquery"
import findAndReplaceDomText from "findandreplacedomtext"

import buildRegex from "../../shared/buildRegex"
import scrollToElement from "./scrollToElement"


export default function search(queryParams) {
  const regex = buildRegex(queryParams)
  findAndReplaceDomText($('body')[0], {
    find: regex,
    replace: createHighlight,
    preset: "prose",
    filterElements,
  })
  scrollToElement($(".deepSearch-current-highlight"))
}

function createHighlight(portion, match) {
  var wrapped = document.createElement("span")
  var wrappedClasses = "deepSearch-highlight"
  if (match.index === 0) {
    wrappedClasses += " deepSearch-current-highlight"
  }
  wrapped.setAttribute("class", wrappedClasses)
  wrapped.setAttribute("data-highlight-index", match.index)
  wrapped.appendChild(document.createTextNode(portion.text))
  return wrapped
}

function filterElements(elem) {
  const $elem = $(elem)
  return $elem.is(":visible") && !$elem.attr("aria-hidden")
}

Answer 1

如果您只想删除元素并保留其子文本，请不要处理 HTML。您应该使用纯 DOM API 来移动文本和元素节点。使用 HTML 解析器最多只能提供次优性能，最坏情况下会产生安全漏洞。

As an aside, the much simpler $(this).replaceWith($(this).html()) results in an explosion in the number of text nodes.

这可以通过将 Node.normalize() 应用于祖先来解决。

关于从 html 中删除标签的方法的安全问题

Security concern over method of removing tags from html

html

javascript

jquery

dom

code-injection

附加信息：