C# 解析具有 ajax 加载内容的网站
C# parsing web site with ajax loaded content
如果我收到一个具有此功能的网站,我会得到整个页面,但没有 ajax 加载的值。
htmlDoc.LoadHtml(new WebClient().DownloadString(url));
是否可以像在 gChrome 中那样使用所有值加载网站?
不,在你的例子中是不可能的。因为它将以字符串形式加载内容。您应该在 "browser engine" 中呈现该字符串或找到任何可以为您完成该操作的组件。
我建议您查看 abotx 他们刚刚宣布了此功能,所以您可能会感兴趣,但它不是免费的。
您可以使用 WebBrowser control to get and render the page. Unfortunately, the control uses Internet Explorer and you have to change a registry value 来强制它使用最新版本,即使那样实现也非常脆弱。
另一种选择是采用独立的浏览器引擎,例如 WebKit and make it work in .NET. I found a page explaining how to do this, but it's pretty dated: http://webkitdotnet.sourceforge.net/basics.php
我开发了一个小演示应用程序来获取内容,这就是我想出的:
class Program
{
static void Main(string[] args)
{
GetRenderedWebPage("https://siderite.dev", TimeSpan.FromSeconds(5), output =>
{
Console.Write(output);
File.WriteAllText("output.txt", output);
});
Console.ReadKey();
}
private static void GetRenderedWebPage(string url, TimeSpan waitAfterPageLoad, Action<string> callBack)
{
const string cEndLine= "All output received";
var sb = new StringBuilder();
var p = new PhantomJS();
p.OutputReceived += (sender, e) =>
{
if (e.Data==cEndLine)
{
callBack(sb.ToString());
} else
{
sb.AppendLine(e.Data);
}
};
p.RunScript(@"
var page = require('webpage').create();
page.viewportSize = { width: 1920, height: 1080 };
page.onLoadFinished = function(status) {
if (status=='success') {
setTimeout(function() {
console.log(page.content);
console.log('" + cEndLine + @"');
phantom.exit();
}," + waitAfterPageLoad.TotalMilliseconds + @");
}
};
var url = '" + url + @"';
page.open(url);", new string[0]);
}
}
这使用 PhantomJS "headless" browser by way of the wrapper NReco.PhantomJS which you can get through "reference NuGet package" directly from Visual Studio. I am sure it can be done better, but this is what I did today. You might want to take a look at the PhantomJS callbacks so you can properly debug what is going on. My example will wait forever if the URL doesn't work, for example. Here is a useful link: https://newspaint.wordpress.com/2013/04/25/getting-to-the-bottom-of-why-a-phantomjs-page-load-fails/
如果我收到一个具有此功能的网站,我会得到整个页面,但没有 ajax 加载的值。
htmlDoc.LoadHtml(new WebClient().DownloadString(url));
是否可以像在 gChrome 中那样使用所有值加载网站?
不,在你的例子中是不可能的。因为它将以字符串形式加载内容。您应该在 "browser engine" 中呈现该字符串或找到任何可以为您完成该操作的组件。
我建议您查看 abotx 他们刚刚宣布了此功能,所以您可能会感兴趣,但它不是免费的。
您可以使用 WebBrowser control to get and render the page. Unfortunately, the control uses Internet Explorer and you have to change a registry value 来强制它使用最新版本,即使那样实现也非常脆弱。
另一种选择是采用独立的浏览器引擎,例如 WebKit and make it work in .NET. I found a page explaining how to do this, but it's pretty dated: http://webkitdotnet.sourceforge.net/basics.php
我开发了一个小演示应用程序来获取内容,这就是我想出的:
class Program
{
static void Main(string[] args)
{
GetRenderedWebPage("https://siderite.dev", TimeSpan.FromSeconds(5), output =>
{
Console.Write(output);
File.WriteAllText("output.txt", output);
});
Console.ReadKey();
}
private static void GetRenderedWebPage(string url, TimeSpan waitAfterPageLoad, Action<string> callBack)
{
const string cEndLine= "All output received";
var sb = new StringBuilder();
var p = new PhantomJS();
p.OutputReceived += (sender, e) =>
{
if (e.Data==cEndLine)
{
callBack(sb.ToString());
} else
{
sb.AppendLine(e.Data);
}
};
p.RunScript(@"
var page = require('webpage').create();
page.viewportSize = { width: 1920, height: 1080 };
page.onLoadFinished = function(status) {
if (status=='success') {
setTimeout(function() {
console.log(page.content);
console.log('" + cEndLine + @"');
phantom.exit();
}," + waitAfterPageLoad.TotalMilliseconds + @");
}
};
var url = '" + url + @"';
page.open(url);", new string[0]);
}
}
这使用 PhantomJS "headless" browser by way of the wrapper NReco.PhantomJS which you can get through "reference NuGet package" directly from Visual Studio. I am sure it can be done better, but this is what I did today. You might want to take a look at the PhantomJS callbacks so you can properly debug what is going on. My example will wait forever if the URL doesn't work, for example. Here is a useful link: https://newspaint.wordpress.com/2013/04/25/getting-to-the-bottom-of-why-a-phantomjs-page-load-fails/