正则表达式未从 Html 字符串中删除内联 css

Regex expression not removing inline css from Html String

背景

目前有一个控制台应用程序可以从 0365 outlook 帐户获取电子邮件,我使用的是 outlook api 2.0

问题

我正在使用 api 访问电子邮件正文,但是正文以 html 字符串形式出现。我正在使用我的 go to regex 功能删除 html 标签,但是 outlook 添加了 css class 到他们的 Html 这基本上使我的 regex 表达式过时了。

代码

string body = "<html>
<head>
<meta http-equiv="Content-Type" content="text/html; charset=utf-8">
<meta content="text/html; charset=us-ascii">
<meta name="Generator" content="Microsoft Word 15 (filtered medium)">
<style>
<!--
@font-face
    {font-family:"Cambria Math"}
@font-face
    {font-family:Calibri}
p.MsoNormal, li.MsoNormal, div.MsoNormal
    {margin:0in;
    margin-bottom:.0001pt;
    font-size:11.0pt;
    font-family:"Calibri",sans-serif}
a:link, span.MsoHyperlink
    {color:#0563C1;
    text-decoration:underline}
a:visited, span.MsoHyperlinkFollowed
    {color:#954F72;
    text-decoration:underline}
span.EmailStyle17
    {font-family:"Calibri",sans-serif;
    color:windowtext}
.MsoChpDefault
    {font-family:"Calibri",sans-serif}
@page WordSection1
    {margin:1.0in 1.0in 1.0in 1.0in}
div.WordSection1
    {}
-->
</style>
</head>
<body lang="EN-US" link="#0563C1" vlink="#954F72">
<div class="WordSection1">
<p class="MsoNormal">&nbsp;</p>
</div>
<hr>
<p><b>Confidentiality Notice:</b> This e-mail is intended only for the addressee named above. It contains information that is privileged, confidential or otherwise protected from use and disclosure. If you are not the intended recipient, you are hereby notified
 that any review, disclosure, copying, or dissemination of this transmission, or taking of any action in reliance on its contents, or other use is strictly prohibited. If you have received this transmission in error, please reply to the sender listed above
 immediately and permanently delete this message from your inbox. Thank you for your cooperation.</p>
</body>
</html>
";
string viewString1 = Regex.Replace(body, "<.*?>", string.Empty);
string viewString12 = viewString1.Replace("&nbsp;", string.Empty);

我的正则表达式的结果

<!--
@font-face
    {font-family:"Cambria Math"}
@font-face
    {font-family:Calibri}
p.MsoNormal, li.MsoNormal, div.MsoNormal
    {margin:0in;
    margin-bottom:.0001pt;
    font-size:11.0pt;
    font-family:"Calibri",sans-serif}
a:link, span.MsoHyperlink
    {color:#0563C1;
    text-decoration:underline}
a:visited, span.MsoHyperlinkFollowed
    {color:#954F72;
    text-decoration:underline}
span.EmailStyle17
    {font-family:"Calibri",sans-serif;
    color:windowtext}
.MsoChpDefault
    {font-family:"Calibri",sans-serif}
@page WordSection1
    {margin:1.0in 1.0in 1.0in 1.0in}
div.WordSection1
    {}
-->







Confidentiality Notice: This e-mail is intended only for the addressee named above. It contains information that is privileged, confidential or otherwise protected from use and disclosure. If you are not the intended recipient, you are hereby notified
 that any review, disclosure, copying, or dissemination of this transmission, or taking of any action in reliance on its contents, or other use is strictly prohibited. If you have received this transmission in error, please reply to the sender listed above
 immediately and permanently delete this message from your inbox. Thank you for your cooperation.

Objective

我需要能够从字符串中删除 html 标签,并删除 outlook 放入正文中的 css classes。

您可以用 regex option Singleline<!--.*?--> 替换为 String.Empty(这使得 . 匹配新行):

string viewString1 = Regex.Replace(body, "<.*?>", string.Empty, RegexOptions.Singleline);