这是将 html 转换为文本的安全方法吗

Question

我收到了一些半不受信任的 API 的回复，应该包含 html。现在我想将其转换为纯文本，基本上去除所有格式，以便我可以轻松搜索它，然后显示（部分）它。

我想到了这个：

function convertHtmlToText(html) {
    const div = document.createElement("div");
    // assumpton: because the div is not part of the document 
    // - no scripts are executed
    // - no layout pass
    div.innerHTML = html; 
    // assumption: whitespace is still normalized
    // assumption: this returns the text a user would see, if the element was inserted into the DOM.
    //             Minus the stuff that would depend on stylesheets anyway.
    return div.innerText; 
}

const html = `
    Some random untrusted string that is supposed to contain html. 
    Presumably some 'rich text'. 
    A few <div> or <p>, a link or two, a bit of <strong> and some such. 
    In any case not a complete html document.
`;

const text = convertHtmlToText(html);

const p = document.createElement("p");
p.textContent = text;
document.body.append(p);

我认为这是safe/secure，因为只要用于转换的div没有插入到文档中，脚本就不会执行。

问题：这是safe/secure吗？

Answer 1

不，这根本不安全。

function convertHtmlToText(html) {
    const div = document.createElement("div");
    // assumpton: because the div is not part of the document 
    // - no scripts are executed
    // - no layout pass
    div.innerHTML = html; 
    // assumption: whitespace is still normalized
    // assumption: this returns the text a user would see, if the element was inserted into the DOM.
    //             Minus the stuff that would depend on stylesheets anyway.
    return div.innerText; 
}

const html = `<img onerror="alert('Gotcha!')" src="">Hi`;

const text = convertHtmlToText(html);

const p = document.createElement("p");
p.textContent = text;
document.body.append(p);

如果你真的只能处理文本内容，那么更喜欢不会执行任何脚本的 DOMParser:

function convertHtmlToText(html) {
  const doc = new DOMParser().parseFromString(html, 'text/html');
  return doc.body.innerText;
}

const html = `<img onerror="alert('Gotcha!')" src="">Hi`;

const text = convertHtmlToText(html);

const p = document.createElement("p");
p.textContent = text;
document.body.append(p);

但要注意这些方法也会捕获用户通常看不到的节点的文本内容（例如 <style> 或 <script>）。

这是将 html 转换为文本的安全方法吗

Is this a secure way to convert html to text

javascript

xss