清理带有过多标签的电子邮件 HTML 正文

Clean email HTML body with too many tags

我想清理很多有点脏的邮件HTML 正文(取自 Gmail 发送的电子邮件):有很多嵌套 <div>,不需要的字体更改, ETC。 我想清理这个,只保留 <a><b><br><i><img>,没有其他东西(当且仅当确实有必要时,也可能 <p> 或一些 <div>)。

使用 /<\/?(?!(a|br|b|img)\b)\w+[^>]*>/g,大部分时间都有效:

document.onclick = function() {
    document.body.innerHTML = document.body.innerHTML.replace(/<\/?(?!(a|br|b|img)\b)\w+[^>]*>/g, '');
}
<div dir="ltr"><div class="gmail_quote"><div dir="ltr">Hello,<div><br></div><div><div><div style="font-size:12.8px"><span style="font-size:12.8px">Thank you for your message.</span><br></div><div style="font-size:12.8px"><br></div><div style="font-size:12.8px"><span style="font-size:12.8px">If the L<span class="m_-527331299899979m_70391001927gmail-il">orem</span>i</span><span class="m_-527331299899979m_703910001927gmail-m_2466414472930393055gmail-il" style="font-size:12.8px">psum</span><span style="font-size:12.8px"> bla bla </span><a href="http://example.com" style="font-size:12.8px" target="_blank">test</a><span style="font-size:12.8px"> window, then it will be like this.</span><br></div><div style="font-size:12.8px">Blah blah.</div><div style="font-size:12.8px"><br></div><div style="font-size:12.8px">Lorem ipsum<span style="font-size:12.8px">lorem ipsum </span><span style="font-size:12.8px">blah blah and</span><span style="font-size:12.8px"> you can </span><span style="font-size:12.8px">also <i>blah blah</i> and finally <i>Blah</i>.</span></div><div style="font-size:12.8px"><span style="font-size:12.8px"><br></span></div><div style="font-size:12.8px"><span style="font-size:12.8px">-----------</span></div><div style="font-size:12.8px"><span style="font-size:12.8px"><br></span></div><div style="font-size:12.8px"><span style="font-size:12.8px">Examples:</span></div><div style="font-size:12.8px"><span style="font-size:12.8px"><br></span></div><div><div><span style="font-size:12.8px">example: <a href="http://example.com" target="_blank">test1</a></span></div><div><span style="font-size:12.8px">example: <a href="http://example.com" target="_blank">test2</a></span></div><div><br></div><div><div><span style="font-size:12.8px">example: <a href="http://example.com" target="_blank">test3</a></span></div><div><span style="font-size:12.8px">example: <a href="http://example.com" target="_blank">test4</a></span></div></div><div><br></div><div><span style="font-size:12.8px">example: <a href="http://example.com" target="_blank">test4</a></span></div><div><span style="font-size:12.8px">example: <a href="http://example.com" target="_blank">test5</a></span></div><div><span style="font-size:12.8px"><br></span></div><div><span style="font-size:12.8px">example: <a href="http://example.com" target="_blank">example</a></span></div><div><span style="font-size:12.8px">example: <a href="http://example.com" target="_blank">ex<wbr>ample</a></span></div><div><span style="font-size:12.8px"><br></span></div><div><span style="font-size:12.8px">example: <a href="http://example.com" target="_blank">example</a></span></div><div><span style="font-size:12.8px">example: <a href="http://example.com" target="_blank">exam<wbr>ple</a></span></div><div><span style="font-size:12.8px"><br></span></div><div><br></div></div></div><div class="gmail_extra" style="font-size:12.8px"><div class="m_-52733129979m_703911927gmail-m_24664144055gmail_signature"><div dir="ltr"><div dir="ltr"><div dir="ltr"><div dir="ltr"><div dir="ltr"><span style="font-size:small">Sincerly,</span><br></div></div></div></div></div></div></div></div><div><div><div class="m_-52722719979m_7039100982345401927gmail_signature"><div dir="ltr"><div><div dir="ltr"><div><div dir="ltr"><div><div dir="ltr"><div><div dir="ltr"><br></div><div>Myself<br></div><div dir="ltr"><br><b>example</b><br>web: <a href="http://www.example.com" target="_blank">www.example.com</a><br></div><div>fb: <a href="http://www.facebook.com/example/" target="_blank">www.facebook.com/LoremIp<wbr>sum/</a><br></div><div>mail: <a href="mailto:contact@example.com" target="_blank">contact@example.com</a><br></div><div dir="ltr"><br><img src="http://example.com/example.png"><br></div></div></div></div></div></div></div></div></div></div></div></div></div></div><br></div>

(在 运行 代码段后单击电子邮件中的任意位置以查看在应用 正则表达式后 会发生什么)

确实:

  • 无用标签 <span></span> 已成功删除
  • <div fontstyle="..."></div> 已删除

但是像这样删除<div>时还有一个问题:

  • 空行被移除(查看邮件输出的第1行和第3行之间的空行,第3行和第5行之间的空行等)

  • 在每个 example: test1 之后删除换行符(请参阅 运行 代码段)

我尝试用<br><br>替换<div.*?><br></div>,但仍然不正确。

问题:如何清理这段HTML代码,丢弃不需要的字体变化等,并保留相同的空行,并保留<a><b>, <br>, <i>, <img> 标签?

注意:它最终必须 运行 在 Google Apps 脚本中,所以我不确定是否可以导入第三方 JS 库...

以下 5 步过程适用于您提供的示例:

  1. 在第一段,保留 div 个标签,但删除所有其他不需要的标签。
  2. <div><br></div>替换为<br><br>
  3. 用单个 <br>.
  4. 替换任意序列的 1 个或多个结束 </div> 标签,前面可能有 <br>
  5. 删除所有 div 个标签。
  6. 用两个 <br> 标签替换 2 个或更多 <br> 破布的任意序列。

代码:

document.onclick = function() {
    document.body.innerHTML = document.body.innerHTML
                              .replace(/<\/?(?!(a|br|b|i|img|div)\b)\w+[^>]*>/g, '')
                              .replace(/<div[^>]*><br><\/div>/g, '<br><br>')
                              .replace(/((<br>)?<\/div>)+/g, '<br>')
                              .replace(/<div[^>]*>/g, '')
                              .replace(/(<br>){2,}/g, '<br><br>');
}
<div dir="ltr"><div class="gmail_quote"><div dir="ltr">Hello,<div><br></div><div><div><div style="font-size:12.8px"><span style="font-size:12.8px">Thank you for your message.</span><br></div><div style="font-size:12.8px"><br></div><div style="font-size:12.8px"><span style="font-size:12.8px">If the L<span class="m_-527331299899979m_70391001927gmail-il">orem</span>i</span><span class="m_-527331299899979m_703910001927gmail-m_2466414472930393055gmail-il" style="font-size:12.8px">psum</span><span style="font-size:12.8px"> bla bla </span><a href="http://example.com" style="font-size:12.8px" target="_blank">test</a><span style="font-size:12.8px"> window, then it will be like this.</span><br></div><div style="font-size:12.8px">Blah blah.</div><div style="font-size:12.8px"><br></div><div style="font-size:12.8px">Lorem ipsum<span style="font-size:12.8px">lorem ipsum </span><span style="font-size:12.8px">blah blah and</span><span style="font-size:12.8px"> you can </span><span style="font-size:12.8px">also <i>blah blah</i> and finally <i>Blah</i>.</span></div><div style="font-size:12.8px"><span style="font-size:12.8px"><br></span></div><div style="font-size:12.8px"><span style="font-size:12.8px">-----------</span></div><div style="font-size:12.8px"><span style="font-size:12.8px"><br></span></div><div style="font-size:12.8px"><span style="font-size:12.8px">Examples:</span></div><div style="font-size:12.8px"><span style="font-size:12.8px"><br></span></div><div><div><span style="font-size:12.8px">example: <a href="http://example.com" target="_blank">test1</a></span></div><div><span style="font-size:12.8px">example: <a href="http://example.com" target="_blank">test2</a></span></div><div><br></div><div><div><span style="font-size:12.8px">example: <a href="http://example.com" target="_blank">test3</a></span></div><div><span style="font-size:12.8px">example: <a href="http://example.com" target="_blank">test4</a></span></div></div><div><br></div><div><span style="font-size:12.8px">example: <a href="http://example.com" target="_blank">test4</a></span></div><div><span style="font-size:12.8px">example: <a href="http://example.com" target="_blank">test5</a></span></div><div><span style="font-size:12.8px"><br></span></div><div><span style="font-size:12.8px">example: <a href="http://example.com" target="_blank">example</a></span></div><div><span style="font-size:12.8px">example: <a href="http://example.com" target="_blank">ex<wbr>ample</a></span></div><div><span style="font-size:12.8px"><br></span></div><div><span style="font-size:12.8px">example: <a href="http://example.com" target="_blank">example</a></span></div><div><span style="font-size:12.8px">example: <a href="http://example.com" target="_blank">exam<wbr>ple</a></span></div><div><span style="font-size:12.8px"><br></span></div><div><br></div></div></div><div class="gmail_extra" style="font-size:12.8px"><div class="m_-52733129979m_703911927gmail-m_24664144055gmail_signature"><div dir="ltr"><div dir="ltr"><div dir="ltr"><div dir="ltr"><div dir="ltr"><span style="font-size:small">Sincerly,</span><br></div></div></div></div></div></div></div></div><div><div><div class="m_-52722719979m_7039100982345401927gmail_signature"><div dir="ltr"><div><div dir="ltr"><div><div dir="ltr"><div><div dir="ltr"><div><div dir="ltr"><br></div><div>Myself<br></div><div dir="ltr"><br><b>example</b><br>web: <a href="http://www.example.com" target="_blank">www.example.com</a><br></div><div>fb: <a href="http://www.facebook.com/example/" target="_blank">www.facebook.com/LoremIp<wbr>sum/</a><br></div><div>mail: <a href="mailto:contact@example.com" target="_blank">contact@example.com</a><br></div><div dir="ltr"><br><img src="http://example.com/example.png"><br></div></div></div></div></div></div></div></div></div></div></div></div></div></div><br></div>

这是我最终使用的(适用于通过 Gmail 发送的所有电子邮件),99.99% 归功于@Michelle 接受的答案:

document.onclick = function() {
    document.body.innerHTML = document.body.innerHTML.replace(/<\/?(?!(a|br|b|i|img|div)\b)\w+[^>]*>/g, '')
             .replace(/<div[^>]*><br[^>]*>/g, '<br><br>')
             .replace(/((<br>)?<\/div>)+/g, '<br>')
             .replace(/<div[^>]*>/g, '')
             .replace(/(<br>){2,}/g, '<br><br>')
             .replace(/ style="font-size.*?"/g, ''); 
}
<div dir="ltr"><div class="gmail_quote"><div dir="ltr">Hello,<div><br></div><div><div><div style="font-size:12.8px"><span style="font-size:12.8px">Thank you for your message.</span><br></div><div style="font-size:12.8px"><br></div><div style="font-size:12.8px"><span style="font-size:12.8px">If the L<span class="m_-527331299899979m_70391001927gmail-il">orem</span>i</span><span class="m_-527331299899979m_703910001927gmail-m_2466414472930393055gmail-il" style="font-size:12.8px">psum</span><span style="font-size:12.8px"> bla bla </span><a href="http://example.com" style="font-size:12.8px" target="_blank">test</a><span style="font-size:12.8px"> window, then it will be like this.</span><br></div><div style="font-size:12.8px">Blah blah.</div><div style="font-size:12.8px"><br></div><div style="font-size:12.8px">Lorem ipsum<span style="font-size:12.8px">lorem ipsum </span><span style="font-size:12.8px">blah blah and</span><span style="font-size:12.8px"> you can </span><span style="font-size:12.8px">also <i>blah blah</i> and finally <i>Blah</i>.</span></div><div style="font-size:12.8px"><span style="font-size:12.8px"><br></span></div><div style="font-size:12.8px"><span style="font-size:12.8px">-----------</span></div><div style="font-size:12.8px"><span style="font-size:12.8px"><br></span></div><div style="font-size:12.8px"><span style="font-size:12.8px">Examples:</span></div><div style="font-size:12.8px"><span style="font-size:12.8px"><br></span></div><div><div><span style="font-size:12.8px">example: <a href="http://example.com" target="_blank">test1</a></span></div><div><span style="font-size:12.8px">example: <a href="http://example.com" target="_blank">test2</a></span></div><div><br></div><div><div><span style="font-size:12.8px">example: <a href="http://example.com" target="_blank">test3</a></span></div><div><span style="font-size:12.8px">example: <a href="http://example.com" target="_blank">test4</a></span></div></div><div><br></div><div><span style="font-size:12.8px">example: <a href="http://example.com" target="_blank">test4</a></span></div><div><span style="font-size:12.8px">example: <a href="http://example.com" target="_blank">test5</a></span></div><div><span style="font-size:12.8px"><br></span></div><div><span style="font-size:12.8px">example: <a href="http://example.com" target="_blank">example</a></span></div><div><span style="font-size:12.8px">example: <a href="http://example.com" target="_blank">ex<wbr>ample</a></span></div><div><span style="font-size:12.8px"><br></span></div><div><span style="font-size:12.8px">example: <a href="http://example.com" target="_blank">example</a></span></div><div><span style="font-size:12.8px">example: <a href="http://example.com" target="_blank">exam<wbr>ple</a></span></div><div><span style="font-size:12.8px"><br></span></div><div><br></div></div></div><div class="gmail_extra" style="font-size:12.8px"><div class="m_-52733129979m_703911927gmail-m_24664144055gmail_signature"><div dir="ltr"><div dir="ltr"><div dir="ltr"><div dir="ltr"><div dir="ltr"><span style="font-size:small">Sincerly,</span><br></div></div></div></div></div></div></div></div><div><div><div class="m_-52722719979m_7039100982345401927gmail_signature"><div dir="ltr"><div><div dir="ltr"><div><div dir="ltr"><div><div dir="ltr"><div><div dir="ltr"><br></div><div>Myself<br></div><div dir="ltr"><br><b>example</b><br>web: <a href="http://www.example.com" target="_blank">www.example.com</a><br></div><div>fb: <a href="http://www.facebook.com/example/" target="_blank">www.facebook.com/LoremIp<wbr>sum/</a><br></div><div>mail: <a href="mailto:contact@example.com" target="_blank">contact@example.com</a><br></div><div dir="ltr"><br><img src="http://example.com/example.png"><br></div></div></div></div></div></div></div></div></div></div></div></div></div></div><br></div>