这是将 html 转换为文本的安全方法吗

Is this a secure way to convert html to text

我收到了一些半不受信任的 API 的回复,应该包含 html。 现在我想将其转换为纯文本,基本上去除所有格式,以便我可以轻松搜索它,然后显示(部分)它。

我想到了这个:

function convertHtmlToText(html) {
    const div = document.createElement("div");
    // assumpton: because the div is not part of the document 
    // - no scripts are executed
    // - no layout pass
    div.innerHTML = html; 
    // assumption: whitespace is still normalized
    // assumption: this returns the text a user would see, if the element was inserted into the DOM.
    //             Minus the stuff that would depend on stylesheets anyway.
    return div.innerText; 
}

const html = `
    Some random untrusted string that is supposed to contain html. 
    Presumably some 'rich text'. 
    A few <div> or <p>, a link or two, a bit of <strong> and some such. 
    In any case not a complete html document.
`;

const text = convertHtmlToText(html);

const p = document.createElement("p");
p.textContent = text;
document.body.append(p);

认为这是safe/secure,因为只要用于转换的div没有插入到文档中,脚本就不会执行。

问题:这是safe/secure吗?

不,这根本不安全。

function convertHtmlToText(html) {
    const div = document.createElement("div");
    // assumpton: because the div is not part of the document 
    // - no scripts are executed
    // - no layout pass
    div.innerHTML = html; 
    // assumption: whitespace is still normalized
    // assumption: this returns the text a user would see, if the element was inserted into the DOM.
    //             Minus the stuff that would depend on stylesheets anyway.
    return div.innerText; 
}

const html = `<img onerror="alert('Gotcha!')" src="">Hi`;

const text = convertHtmlToText(html);

const p = document.createElement("p");
p.textContent = text;
document.body.append(p);

如果你真的只能处理文本内容,那么更喜欢不会执行任何脚本的 DOMParser:

function convertHtmlToText(html) {
  const doc = new DOMParser().parseFromString(html, 'text/html');
  return doc.body.innerText;
}

const html = `<img onerror="alert('Gotcha!')" src="">Hi`;

const text = convertHtmlToText(html);

const p = document.createElement("p");
p.textContent = text;
document.body.append(p);

但要注意这些方法也会捕获用户通常看不到的节点的文本内容(例如 <style><script>)。