HtmlAgilityPack 将 <(小于号)之后的所有内容都视为属性
HtmlAgilityPack treats everything after < (less than sign) as attributes
我有一些通过文本区域获得的输入,我将该输入转换为 html 文档,稍后将其解析为 PDF 文档。
当我的用户输入小于号 (<) 时,我的 HtmlDocument 中的一切都停止了。 HtmlAgilityPack 突然将小于号之后的所有内容都作为属性处理。查看输出:
Within this Character Data block I can use double dashes as much as I want (along with <, &,="" ',="" and="" ')="" *and="" *="" %="" myparamentity;="" will="" be="" expanded="" to="" the="" text="" 'has="" been="" expanded'...however,="" i="" can't="" use="" the="" cend="" sequence(if="" i="" need="" to="" use="" it="" i="" must="" escape="" one="" of="" the="" brackets="" or="" the="" greater-than="" sign).="">
如果我加上
会好一点
htmlDocument.OptionOutputOptimizeAttributeValues = true;
这给了我:
Within this Character Data block I can use double dashes as much as I want (along with <, &,= ',= and= ')= *and= *= %= myparamentity;= will= be= expanded= to= the= text= 'has= been= expanded'...however,= i= can't= use= the= cend= sequence(if= i= need= to= use= it= i= must= escape= one= of= the= brackets= or= the= greater-than= sign).=>
我已经尝试了 html 文档中的所有选项,其中 none 让我指定解析器不应该是严格的。另一方面,我可能可以忍受它去掉 <,但是添加所有等号对我来说真的不起作用。
void Main()
{
var input = @"Within this Character Data block I can use double dashes as much as I want (along with <, &, ', and ') *and * % MyParamEntity; will be expanded to the text 'Has been expanded'...however, I can't use the CEND sequence(if I need to use it I must escape one of the brackets or the greater-than sign).";
var htmlDoc = WrapContentInHtml(input);
htmlDoc.DocumentNode.OuterHtml.ToString().Dump();
}
private HtmlDocument WrapContentInHtml(string content)
{
var htmlBuilder = new StringBuilder();
htmlBuilder.AppendLine("<!DOCTYPE html>");
htmlBuilder.AppendLine("<html>");
htmlBuilder.AppendLine("<head>");
htmlBuilder.AppendLine("<title></title>");
htmlBuilder.AppendLine("</head>");
htmlBuilder.AppendLine("<body><div id='sagsfremstillingContainer'>");
htmlBuilder.AppendLine(content);
htmlBuilder.AppendLine("</div></body></html>");
var htmlDocument = new HtmlDocument();
htmlDocument.OptionOutputOptimizeAttributeValues = true;
var htmlDoc = htmlBuilder.ToString();
htmlDocument.LoadHtml(htmlDoc);
return htmlDocument;
}
有没有人知道我该如何解决这个问题。
我能找到的最接近的问题是:
Losing the 'less than' sign in HtmlAgilityPack loadhtml
他实际上抱怨 < 消失,这对我来说没问题。当然修复解析错误是最好的解决办法。
编辑:
我正在使用 HtmlAgilityPack 1.4.9
您的内容明显错误。这不是关于 "strictness",而是关于你假装一段文本是有效的事实 HTML。事实上,你得到的结果正是因为解析器 not strict.
当你需要在HTML中插入纯文本时,你需要先对其进行编码,以便将所有各种HTML控制字符正确转换为HTML——例如, <
必须更改为 <
并且 &
必须更改为 &
.
处理此问题的一种方法是使用 DOM - 在目标 div
上使用 InnerText
,而不是将字符串拍打在一起并假装它们是 HTML .另一种是使用一些显式编码方法——例如 HttpUtility.HtmlEncode
.
您可以使用System.Net.WebUtility.HtmlEncode
which works even without a reference to System.Web.dll
which also has HttpServerUtility.HtmlEncode
var input = @"Within this Character Data block I can use double dashes as much as I want (along with <, &, ', and ') *and * % MyParamEntity; will be expanded to the text 'Has been expanded'...however, I can't use the CEND sequence(if I need to use it I must escape one of the brackets or the greater-than sign).";
var htmlDocument = new HtmlDocument();
htmlDocument.LoadHtml(System.Net.WebUtility.HtmlEncode(input));
Debug.Assert(!htmlDocument.ParseErrors.Any());
结果:
Within this Character Data block I can use double dashes as much as I want (along with <, &, ', and ') *and * % MyParamEntity; will be expanded to the text 'Has been expanded'...however, I can't use the CEND sequence(if I need to use it I must escape one of the brackets or the greater-than sign).
我有一些通过文本区域获得的输入,我将该输入转换为 html 文档,稍后将其解析为 PDF 文档。
当我的用户输入小于号 (<) 时,我的 HtmlDocument 中的一切都停止了。 HtmlAgilityPack 突然将小于号之后的所有内容都作为属性处理。查看输出:
Within this Character Data block I can use double dashes as much as I want (along with <, &,="" ',="" and="" ')="" *and="" *="" %="" myparamentity;="" will="" be="" expanded="" to="" the="" text="" 'has="" been="" expanded'...however,="" i="" can't="" use="" the="" cend="" sequence(if="" i="" need="" to="" use="" it="" i="" must="" escape="" one="" of="" the="" brackets="" or="" the="" greater-than="" sign).="">
如果我加上
会好一点htmlDocument.OptionOutputOptimizeAttributeValues = true;
这给了我:
Within this Character Data block I can use double dashes as much as I want (along with <, &,= ',= and= ')= *and= *= %= myparamentity;= will= be= expanded= to= the= text= 'has= been= expanded'...however,= i= can't= use= the= cend= sequence(if= i= need= to= use= it= i= must= escape= one= of= the= brackets= or= the= greater-than= sign).=>
我已经尝试了 html 文档中的所有选项,其中 none 让我指定解析器不应该是严格的。另一方面,我可能可以忍受它去掉 <,但是添加所有等号对我来说真的不起作用。
void Main()
{
var input = @"Within this Character Data block I can use double dashes as much as I want (along with <, &, ', and ') *and * % MyParamEntity; will be expanded to the text 'Has been expanded'...however, I can't use the CEND sequence(if I need to use it I must escape one of the brackets or the greater-than sign).";
var htmlDoc = WrapContentInHtml(input);
htmlDoc.DocumentNode.OuterHtml.ToString().Dump();
}
private HtmlDocument WrapContentInHtml(string content)
{
var htmlBuilder = new StringBuilder();
htmlBuilder.AppendLine("<!DOCTYPE html>");
htmlBuilder.AppendLine("<html>");
htmlBuilder.AppendLine("<head>");
htmlBuilder.AppendLine("<title></title>");
htmlBuilder.AppendLine("</head>");
htmlBuilder.AppendLine("<body><div id='sagsfremstillingContainer'>");
htmlBuilder.AppendLine(content);
htmlBuilder.AppendLine("</div></body></html>");
var htmlDocument = new HtmlDocument();
htmlDocument.OptionOutputOptimizeAttributeValues = true;
var htmlDoc = htmlBuilder.ToString();
htmlDocument.LoadHtml(htmlDoc);
return htmlDocument;
}
有没有人知道我该如何解决这个问题。
我能找到的最接近的问题是: Losing the 'less than' sign in HtmlAgilityPack loadhtml
他实际上抱怨 < 消失,这对我来说没问题。当然修复解析错误是最好的解决办法。
编辑: 我正在使用 HtmlAgilityPack 1.4.9
您的内容明显错误。这不是关于 "strictness",而是关于你假装一段文本是有效的事实 HTML。事实上,你得到的结果正是因为解析器 not strict.
当你需要在HTML中插入纯文本时,你需要先对其进行编码,以便将所有各种HTML控制字符正确转换为HTML——例如, <
必须更改为 <
并且 &
必须更改为 &
.
处理此问题的一种方法是使用 DOM - 在目标 div
上使用 InnerText
,而不是将字符串拍打在一起并假装它们是 HTML .另一种是使用一些显式编码方法——例如 HttpUtility.HtmlEncode
.
您可以使用System.Net.WebUtility.HtmlEncode
which works even without a reference to System.Web.dll
which also has HttpServerUtility.HtmlEncode
var input = @"Within this Character Data block I can use double dashes as much as I want (along with <, &, ', and ') *and * % MyParamEntity; will be expanded to the text 'Has been expanded'...however, I can't use the CEND sequence(if I need to use it I must escape one of the brackets or the greater-than sign).";
var htmlDocument = new HtmlDocument();
htmlDocument.LoadHtml(System.Net.WebUtility.HtmlEncode(input));
Debug.Assert(!htmlDocument.ParseErrors.Any());
结果:
Within this Character Data block I can use double dashes as much as I want (along with <, &, ', and ') *and * % MyParamEntity; will be expanded to the text 'Has been expanded'...however, I can't use the CEND sequence(if I need to use it I must escape one of the brackets or the greater-than sign).