HTML W3C 工具的 Agility Pack 问题
HTML Agility Pack Problems with W3C tools
我正在尝试通过传递 url(例如
)来访问 w3C mobileOK Checker 的 HTML 结果
http://validator.w3.org/mobile/check?async=false&docAddr=http%3A%2F%2Fwww.google.com/%2Ftv%2F
如果您将 URL 放在浏览器中,它就可以工作,但我似乎无法通过 HTMLAgilityPack 访问它。原因可能是 URL 需要向其服务器发送大量请求,因为它是在线测试,因此它不仅仅是 "static" URL。我访问了其他 URLs 没有任何问题。下面是我的代码:
HtmlAgilityPack.HtmlDocument webGet = new HtmlAgilityPack.HtmlDocument();
HtmlWeb hw = new HtmlWeb();
webGet = hw.Load("http://validator.w3.org/mobile/check?async=false&docAddr=http%3A%2F%2Fwww.google.com/%2Ftv%2F");
HtmlNodeCollection nodes = webGet.DocumentNode.SelectNodes("//head");
if (nodes != null)
{
foreach(HtmlNode n in nodes)
{
string x = n.InnerHtml;
}
}
编辑:我试图通过 Stream Reader 和网站 returns 访问它,出现以下错误:远程服务器返回错误:(403) 禁止访问。
我猜它是相关的。
我检查了您的示例并能够验证所描述的行为。在我看来 w3.org 检查 请求程序是否是 浏览器或其他任何东西 .
我自己为另一个项目创建了一个 extended webClient class,并且能够成功访问给定的 url。
Program.cs
WebClientExtended client = new WebClientExtended();
string exportPath = @"e:\temp"; // adapt to your own needs
string url = "http://validator.w3.org/mobile/check?async=false&docAddr=http%3A%2F%2Fwww.google.com/%2Ftv%2F";
/// load html by using cusomt webClient class
/// but use HtmlAgilityPack for parsing, manipulation aso
HtmlDocument doc = new HtmlDocument();
doc.LoadHtml(System.Text.Encoding.UTF8.GetString(client.DownloadData(url)));
doc.Save(Path.Combine(exportPath, "check.html"));
WebClientExtended
public class WebClientExtended : WebClient
{
#region Felder
private CookieContainer container = new CookieContainer();
#endregion
#region Eigenschaften
public CookieContainer CookieContainer
{
get { return container; }
set { container = value; }
}
#endregion
#region Konstruktoren
public WebClientExtended()
{
this.container = new CookieContainer();
}
#endregion
#region Methoden
protected override WebRequest GetWebRequest(Uri address)
{
WebRequest r = base.GetWebRequest(address);
var request = r as HttpWebRequest;
request.AllowAutoRedirect = false;
request.ServicePoint.Expect100Continue = false;
if (request != null)
{
request.CookieContainer = container;
}
((HttpWebRequest)r).Accept = "text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8";
((HttpWebRequest)r).UserAgent = "Mozilla/5.0 (Windows NT 6.1; WOW64; Trident/7.0; rv:11.0) like Gecko"; //IE
r.Headers.Set("Accept-Encoding", "gzip, deflate, sdch");
r.Headers.Set("Accept-Language", "de-AT,de;q=0.8,en;q=0.6,en-US;q=0.4,fr;q=0.2");
r.Headers.Add(System.Net.HttpRequestHeader.KeepAlive, "1");
((HttpWebRequest)r).AutomaticDecompression = DecompressionMethods.GZip | DecompressionMethods.Deflate;
return r;
}
protected override WebResponse GetWebResponse(WebRequest request)
{
WebResponse response = base.GetWebResponse(request);
if (!string.IsNullOrEmpty(response.Headers["Location"]))
{
request = GetWebRequest(new Uri(response.Headers["Location"]));
request.ContentLength = 0;
response = GetWebResponse(request);
}
return response;
}
#endregion
}
我觉得关键是userAgent的addition/manipulation,Accept-encoding,-language strings。我的代码的结果是下载页面check.html.
我正在尝试通过传递 url(例如
)来访问 w3C mobileOK Checker 的 HTML 结果http://validator.w3.org/mobile/check?async=false&docAddr=http%3A%2F%2Fwww.google.com/%2Ftv%2F
如果您将 URL 放在浏览器中,它就可以工作,但我似乎无法通过 HTMLAgilityPack 访问它。原因可能是 URL 需要向其服务器发送大量请求,因为它是在线测试,因此它不仅仅是 "static" URL。我访问了其他 URLs 没有任何问题。下面是我的代码:
HtmlAgilityPack.HtmlDocument webGet = new HtmlAgilityPack.HtmlDocument();
HtmlWeb hw = new HtmlWeb();
webGet = hw.Load("http://validator.w3.org/mobile/check?async=false&docAddr=http%3A%2F%2Fwww.google.com/%2Ftv%2F");
HtmlNodeCollection nodes = webGet.DocumentNode.SelectNodes("//head");
if (nodes != null)
{
foreach(HtmlNode n in nodes)
{
string x = n.InnerHtml;
}
}
编辑:我试图通过 Stream Reader 和网站 returns 访问它,出现以下错误:远程服务器返回错误:(403) 禁止访问。 我猜它是相关的。
我检查了您的示例并能够验证所描述的行为。在我看来 w3.org 检查 请求程序是否是 浏览器或其他任何东西 .
我自己为另一个项目创建了一个 extended webClient class,并且能够成功访问给定的 url。
Program.cs
WebClientExtended client = new WebClientExtended();
string exportPath = @"e:\temp"; // adapt to your own needs
string url = "http://validator.w3.org/mobile/check?async=false&docAddr=http%3A%2F%2Fwww.google.com/%2Ftv%2F";
/// load html by using cusomt webClient class
/// but use HtmlAgilityPack for parsing, manipulation aso
HtmlDocument doc = new HtmlDocument();
doc.LoadHtml(System.Text.Encoding.UTF8.GetString(client.DownloadData(url)));
doc.Save(Path.Combine(exportPath, "check.html"));
WebClientExtended
public class WebClientExtended : WebClient
{
#region Felder
private CookieContainer container = new CookieContainer();
#endregion
#region Eigenschaften
public CookieContainer CookieContainer
{
get { return container; }
set { container = value; }
}
#endregion
#region Konstruktoren
public WebClientExtended()
{
this.container = new CookieContainer();
}
#endregion
#region Methoden
protected override WebRequest GetWebRequest(Uri address)
{
WebRequest r = base.GetWebRequest(address);
var request = r as HttpWebRequest;
request.AllowAutoRedirect = false;
request.ServicePoint.Expect100Continue = false;
if (request != null)
{
request.CookieContainer = container;
}
((HttpWebRequest)r).Accept = "text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8";
((HttpWebRequest)r).UserAgent = "Mozilla/5.0 (Windows NT 6.1; WOW64; Trident/7.0; rv:11.0) like Gecko"; //IE
r.Headers.Set("Accept-Encoding", "gzip, deflate, sdch");
r.Headers.Set("Accept-Language", "de-AT,de;q=0.8,en;q=0.6,en-US;q=0.4,fr;q=0.2");
r.Headers.Add(System.Net.HttpRequestHeader.KeepAlive, "1");
((HttpWebRequest)r).AutomaticDecompression = DecompressionMethods.GZip | DecompressionMethods.Deflate;
return r;
}
protected override WebResponse GetWebResponse(WebRequest request)
{
WebResponse response = base.GetWebResponse(request);
if (!string.IsNullOrEmpty(response.Headers["Location"]))
{
request = GetWebRequest(new Uri(response.Headers["Location"]));
request.ContentLength = 0;
response = GetWebResponse(request);
}
return response;
}
#endregion
}
我觉得关键是userAgent的addition/manipulation,Accept-encoding,-language strings。我的代码的结果是下载页面check.html.