这是将 html 转换为文本的安全方法吗
Is this a secure way to convert html to text
我收到了一些半不受信任的 API 的回复,应该包含 html。
现在我想将其转换为纯文本,基本上去除所有格式,以便我可以轻松搜索它,然后显示(部分)它。
我想到了这个:
function convertHtmlToText(html) {
const div = document.createElement("div");
// assumpton: because the div is not part of the document
// - no scripts are executed
// - no layout pass
div.innerHTML = html;
// assumption: whitespace is still normalized
// assumption: this returns the text a user would see, if the element was inserted into the DOM.
// Minus the stuff that would depend on stylesheets anyway.
return div.innerText;
}
const html = `
Some random untrusted string that is supposed to contain html.
Presumably some 'rich text'.
A few <div> or <p>, a link or two, a bit of <strong> and some such.
In any case not a complete html document.
`;
const text = convertHtmlToText(html);
const p = document.createElement("p");
p.textContent = text;
document.body.append(p);
我认为这是safe/secure,因为只要用于转换的div
没有插入到文档中,脚本就不会执行。
问题:这是safe/secure吗?
不,这根本不安全。
function convertHtmlToText(html) {
const div = document.createElement("div");
// assumpton: because the div is not part of the document
// - no scripts are executed
// - no layout pass
div.innerHTML = html;
// assumption: whitespace is still normalized
// assumption: this returns the text a user would see, if the element was inserted into the DOM.
// Minus the stuff that would depend on stylesheets anyway.
return div.innerText;
}
const html = `<img onerror="alert('Gotcha!')" src="">Hi`;
const text = convertHtmlToText(html);
const p = document.createElement("p");
p.textContent = text;
document.body.append(p);
如果你真的只能处理文本内容,那么更喜欢不会执行任何脚本的 DOMParser:
function convertHtmlToText(html) {
const doc = new DOMParser().parseFromString(html, 'text/html');
return doc.body.innerText;
}
const html = `<img onerror="alert('Gotcha!')" src="">Hi`;
const text = convertHtmlToText(html);
const p = document.createElement("p");
p.textContent = text;
document.body.append(p);
但要注意这些方法也会捕获用户通常看不到的节点的文本内容(例如 <style>
或 <script>
)。
我收到了一些半不受信任的 API 的回复,应该包含 html。 现在我想将其转换为纯文本,基本上去除所有格式,以便我可以轻松搜索它,然后显示(部分)它。
我想到了这个:
function convertHtmlToText(html) {
const div = document.createElement("div");
// assumpton: because the div is not part of the document
// - no scripts are executed
// - no layout pass
div.innerHTML = html;
// assumption: whitespace is still normalized
// assumption: this returns the text a user would see, if the element was inserted into the DOM.
// Minus the stuff that would depend on stylesheets anyway.
return div.innerText;
}
const html = `
Some random untrusted string that is supposed to contain html.
Presumably some 'rich text'.
A few <div> or <p>, a link or two, a bit of <strong> and some such.
In any case not a complete html document.
`;
const text = convertHtmlToText(html);
const p = document.createElement("p");
p.textContent = text;
document.body.append(p);
我认为这是safe/secure,因为只要用于转换的div
没有插入到文档中,脚本就不会执行。
问题:这是safe/secure吗?
不,这根本不安全。
function convertHtmlToText(html) {
const div = document.createElement("div");
// assumpton: because the div is not part of the document
// - no scripts are executed
// - no layout pass
div.innerHTML = html;
// assumption: whitespace is still normalized
// assumption: this returns the text a user would see, if the element was inserted into the DOM.
// Minus the stuff that would depend on stylesheets anyway.
return div.innerText;
}
const html = `<img onerror="alert('Gotcha!')" src="">Hi`;
const text = convertHtmlToText(html);
const p = document.createElement("p");
p.textContent = text;
document.body.append(p);
如果你真的只能处理文本内容,那么更喜欢不会执行任何脚本的 DOMParser:
function convertHtmlToText(html) {
const doc = new DOMParser().parseFromString(html, 'text/html');
return doc.body.innerText;
}
const html = `<img onerror="alert('Gotcha!')" src="">Hi`;
const text = convertHtmlToText(html);
const p = document.createElement("p");
p.textContent = text;
document.body.append(p);
但要注意这些方法也会捕获用户通常看不到的节点的文本内容(例如 <style>
或 <script>
)。