使用 HtmlAgilityPack 的不可知屏幕抓取工具
Agnostic Screen scraper using HtmlAgilityPack
假设我想要一个屏幕抓取器,它不在乎你传递给它的是 HTML 页面,url 转到 XML 文档,还是 Url 转到文本文件。
示例:
http://tonto.eia.doe.gov/oog/info/wohdp/dslpriwk.txt
如果页面是 HTML 或文本文件,这将起作用:
public class ScreenScrapingService : IScreenScrapingService
{
public XDocument Scrape(string url)
{
var scraper = new HtmlWeb();
var stringWriter = new StringWriter();
var xml = new XmlTextWriter(stringWriter);
scraper.LoadHtmlAsXml(url, xml);
var text = stringWriter.ToString();
return XDocument.Parse(text);
}
}
但是;如果它是 XML 文件,例如:
http://www.eia.gov/petroleum/gasdiesel/includes/gas_diesel_rss.xml
[Test]
public void Scrape_ShouldScrapeSomething()
{
//arrange
var sut = new ScreenScrapingService();
//act
var result = sut.Scrape("http://www.eia.gov/petroleum/gasdiesel/includes/gas_diesel_rss.xml");
//assert
}
然后我得到错误:
An exception of type 'System.Xml.XmlException' occurred in System.Xml.dll but was not handled in user code
是否可以这样写,让它不关心 URL 最终是什么?
要在 visual studio CTR+ALT+E
上获得确切的异常并启用 CommonLanguageRunTimeExceptions
,LoadHtmlAsXml 似乎需要 html,所以最好的选择可能是使用 WebClient.DownloadString(url)
和 HtmlDocument
与 属性 OptionOutputAsXml
设置为 true
如下,当失败时捕获它
public XDocument Scrape(string url)
{
var wc = new WebClient();
var htmlorxml = wc.DownloadString(url);
var doc = new HtmlDocument() { OptionOutputAsXml = true};
var stringWriter = new StringWriter();
doc.Save(stringWriter);
try
{
return XDocument.Parse(stringWriter.ToString());
}
catch
{
//it only gets here when the string is xml already
try
{
return XDocument.Parse(htmlorxml);
}
catch
{
return null;
}
}
}
假设我想要一个屏幕抓取器,它不在乎你传递给它的是 HTML 页面,url 转到 XML 文档,还是 Url 转到文本文件。
示例:
http://tonto.eia.doe.gov/oog/info/wohdp/dslpriwk.txt
如果页面是 HTML 或文本文件,这将起作用:
public class ScreenScrapingService : IScreenScrapingService
{
public XDocument Scrape(string url)
{
var scraper = new HtmlWeb();
var stringWriter = new StringWriter();
var xml = new XmlTextWriter(stringWriter);
scraper.LoadHtmlAsXml(url, xml);
var text = stringWriter.ToString();
return XDocument.Parse(text);
}
}
但是;如果它是 XML 文件,例如:
http://www.eia.gov/petroleum/gasdiesel/includes/gas_diesel_rss.xml
[Test]
public void Scrape_ShouldScrapeSomething()
{
//arrange
var sut = new ScreenScrapingService();
//act
var result = sut.Scrape("http://www.eia.gov/petroleum/gasdiesel/includes/gas_diesel_rss.xml");
//assert
}
然后我得到错误:
An exception of type 'System.Xml.XmlException' occurred in System.Xml.dll but was not handled in user code
是否可以这样写,让它不关心 URL 最终是什么?
要在 visual studio CTR+ALT+E
上获得确切的异常并启用 CommonLanguageRunTimeExceptions
,LoadHtmlAsXml 似乎需要 html,所以最好的选择可能是使用 WebClient.DownloadString(url)
和 HtmlDocument
与 属性 OptionOutputAsXml
设置为 true
如下,当失败时捕获它
public XDocument Scrape(string url)
{
var wc = new WebClient();
var htmlorxml = wc.DownloadString(url);
var doc = new HtmlDocument() { OptionOutputAsXml = true};
var stringWriter = new StringWriter();
doc.Save(stringWriter);
try
{
return XDocument.Parse(stringWriter.ToString());
}
catch
{
//it only gets here when the string is xml already
try
{
return XDocument.Parse(htmlorxml);
}
catch
{
return null;
}
}
}