在 Xamarin 中使用 HtmlAgilityPack 等待 AJAX
Await AJAX with HtmlAgilityPack in Xamarin
我有一个问题以前好像有人问过,但有点不同。我正在尝试从 this website 抓取数据,但问题是它似乎加载了 AJAX。因此,我的应用程序无法在我正在寻找的 HTML 中找到 id 和 类。
您可以通过检查元素或查看源代码来重现此内容。在查看源代码时,我看到的比检查元素时要少得多。
我想我可以找到包含 AJAX 的文件来加载此 html,方法是按 F12,转到网络选项卡并选择 XHR,但我无法找到它。
My question is: how do I retrieve this data or find out what file is
used to collect the data?
我的代码示例(我找不到 Timetable_toolbar_elementSelect_popup0
):
private async Task GetHtmlDocument(string url)
{
HttpWebRequest request = (HttpWebRequest)HttpWebRequest.Create(url);
//request.Credentials = new LoginCredentials().Credentials;
try
{
WebResponse myResponse = await request.GetResponseAsync();
HtmlDocument htmlDoc = new HtmlDocument();
htmlDoc.OptionFixNestedTags = true;
htmlDoc.Load(myResponse.GetResponseStream());
var test = htmlDoc.GetElementbyId("Timetable_toolbar_elementSelect_popup0");
}
catch (Exception e)
{
}
}
我打算将此作为评论留下。但是它变得太大而且格式太糟糕。所以我们开始吧。
首先。使用 ajax 命令调用的 javascript 动态更新站点。
如果您可以打开一个会话并存储包含 SESSIONID 和现在“加密”学校名称的 cookie,那么您可以调用 ajax 命令。
https://roosters.windesheim.nl/ajaxCommand=getWeeklyTimetable&elementType=1&elementId=13090&date=20171126&formatId=7&departmentId=0&filterId=-2
然而,这确实需要您知道什么是 elementType 和什么是 elementId。
此时elementId等于1GLD时指的是Klas。当 formatID(7) 等于“Beknopt”时,它指的是 Roosterformaat。您必须弄清楚其余变量的作用。更重要的是,如果您成功地能够向服务器发出有效的 ajax 命令,那么您将不会得到 html 作为响应,您将收到 JSON.[= 中的数据。 14=]
做你想做的最简单的方法是将所有 类 放在一个单独的 file 中。并将其用作参考点。其他选项也一样。
然后使用 phantomjs.org with Selenium 这样的无头浏览器。这样你就可以找到并点击你想要抓取的类。将 html 加载到 HtmlAgilityPack.HtmlDocument 中,然后执行您需要执行的操作。 Selenium/PhantomJS 直到跟踪您的 cookie。
这种方法速度较慢 - 但更容易做到。
编辑存储来自网络请求的 cookie - 简单的方法。
我不热衷于这个话题。但是OP问。如果有人有更好的方法,请编辑。
CookieContainer cookies = new CookieContainer();
try
{
string webAddr = "https://roosters.windesheim.nl/WebUntis/";
var httpWebRequest = (HttpWebRequest)WebRequest.Create(webAddr);
httpWebRequest.ContentType = "application/json; charset=utf-8";
httpWebRequest.Method = "POST";
httpWebRequest.CookieContainer = cookies;
httpWebRequest.AutomaticDecompression = DecompressionMethods.GZip | DecompressionMethods.Deflate;
httpWebRequest.Headers.Add("X-Requested-With", "XMLHttpRequest");
using (var streamWriter = new StreamWriter(httpWebRequest.GetRequestStream()))
{
string json = "ajaxCommand=getWeeklyTimetable&elementType=1&elementId=13092&date=20171126&formatId=7&departmentId=0&filterId=-2";
streamWriter.Write(json);
streamWriter.Flush();
}
var httpResponse = (HttpWebResponse)httpWebRequest.GetResponse();
using (var streamReader = new StreamReader(httpResponse.GetResponseStream()))
{
cookies.Add(httpWebRequest.CookieContainer.GetCookies(httpWebRequest.RequestUri));
//cookies.Add(httpResponse.Cookies);
var responseText = streamReader.ReadToEnd();
doc.LoadHtml(responseText);
foreach(Cookie c in httpResponse.Cookies)
{
Console.WriteLine(c.ToString());
}
}
}
catch (WebException ex)
{
Console.WriteLine(ex.Message);
}
Console.WriteLine(doc.DocumentNode.InnerHtml);
Console.ReadKey();
使用网络请求调用 ajax 方法的解决方案。
所以我很无聊,想出了大部分。下面缺少的是如何通过 id 识别 Klase。下面的示例将获取 klase '1GLD'。我们需要 cookie 的原因是为了让请求知道我们从哪所学校获取 Klase。此外,下面的代码仅 returns JSON - 而不是 HTML 因为它是我们调用的 ajax 方法。
CookieContainer cookies = new CookieContainer();
try
{
string webAddr = "https://roosters.windesheim.nl/";
var httpWebRequest = (HttpWebRequest)WebRequest.Create(webAddr);
httpWebRequest.ContentType = "application/json; charset=utf-8";
httpWebRequest.Method = "POST";
httpWebRequest.CookieContainer = cookies;
httpWebRequest.AutomaticDecompression = DecompressionMethods.GZip | DecompressionMethods.Deflate;
httpWebRequest.Headers.Add("X-Requested-With", "XMLHttpRequest");
var httpResponse = (HttpWebResponse)httpWebRequest.GetResponse();
using (var streamReader = new StreamReader(httpResponse.GetResponseStream()))
{
cookies.Add(httpWebRequest.CookieContainer.GetCookies(httpWebRequest.RequestUri));
}
}
catch (WebException ex)
{
Console.WriteLine(ex.Message);
}
//According to my web debugger the cookie will last until the 10th of December. So need to fix a new cookie until then.
//I noticed the url used unixtimestamps at the end of the url. So we just add the unixtimestamp at the end for each request.
long unixTimeStamp = new DateTimeOffset(DateTime.Now).ToUnixTimeMilliseconds() - 100;
//we are now ready to call the ajax method and get the JSON.
try
{
string webAddr = "https://roosters.windesheim.nl/WebUntis/Timetable.do?request.preventCache="+unixTimeStamp.ToString();
var httpWebRequest = (HttpWebRequest)WebRequest.Create(webAddr);
httpWebRequest.ContentType = "application/x-www-form-urlencoded; charset=utf-8";
httpWebRequest.Method = "POST";
httpWebRequest.CookieContainer = cookies;
httpWebRequest.AutomaticDecompression = DecompressionMethods.GZip | DecompressionMethods.Deflate;
httpWebRequest.Headers.Add("X-Requested-With", "XMLHttpRequest");
using (var streamWriter = new StreamWriter(httpWebRequest.GetRequestStream()))
{
string json = "ajaxCommand=getWeeklyTimetable&elementType=1&elementId=13090&date=20171126&formatId=7&departmentId=0&filterId=-2";
//The command below will return a JSON datastructure containing all the klases and their relevant ID.
//string otherJson = "ajaxCommand=getPageConfig&type=1&filter=-2"
streamWriter.Write(json);
streamWriter.Flush();
}
var httpResponse = (HttpWebResponse)httpWebRequest.GetResponse();
using (var streamReader = new StreamReader(httpResponse.GetResponseStream()))
{
var responseText = streamReader.ReadToEnd();
//THE RESULTS GETS PRINTED HERE.
Console.Write(responseText);
}
}
catch (WebException ex)
{
Console.WriteLine(ex.Message);
}
Selenium 和 Firefox 驱动程序的其他解决方案。
这更容易做到。但这也需要一些时间。并非所有线程休眠都是必需的。这将提供 HTML 以按照您的要求与 istead 一起工作。但是我发现在最后一个foreach循环中有必要。
public static void Main(string[] args)
{
HtmlDocument doc = new HtmlDocument();
//According to my web debugger the cookie will last until the 10th of December. So need to fix a new cookie until then.
//I noticed the url used unixtimestamps at the end of the url. So we just add the unixtimestamp at the end for each request.
long unixTimeStamp = new DateTimeOffset(DateTime.Now).ToUnixTimeMilliseconds() - 100;
string webAddr = "https://roosters.windesheim.nl/WebUntis/Timetable.do?request.preventCache="+unixTimeStamp.ToString();
var ffOptions = new FirefoxOptions();
ffOptions.BrowserExecutableLocation = @"C:\Program Files (x86)\Mozilla Firefox\firefox.exe";
ffOptions.LogLevel = FirefoxDriverLogLevel.Default;
ffOptions.Profile = new FirefoxProfile { AcceptUntrustedCertificates = true };
var service = FirefoxDriverService.CreateDefaultService();
var driver = new FirefoxDriver(service, ffOptions, TimeSpan.FromSeconds(120));
driver.Navigate().GoToUrl(webAddr);
driver.FindElement(By.XPath("//input[@id='school']")).SendKeys("Windesheim"+Keys.Enter);
Thread.Sleep(2000);
driver.FindElement(By.XPath("//span[@id='dijit_PopupMenuBarItem_0_text' and text() ='Lesrooster']")).Click();
driver.FindElement(By.XPath("//td[@id='dijit_MenuItem_0_text' and text() ='Klassen']")).Click();
Thread.Sleep(2000);
driver.FindElement(By.XPath("//div[@id='widget_Timetable_toolbar_elementSelect']//input[@class='dijitReset dijitInputField dijitArrowButtonInner']")).Click();
//we get all the options for Klase
doc.LoadHtml(driver.PageSource);
HtmlNodeCollection nodes = doc.DocumentNode.SelectNodes("//div[@id='Timetable_toolbar_elementSelect_popup']/div[@item]");
List<String> options = new List<String>();
foreach (HtmlNode n in nodes)
{
options.Add(n.InnerText);
}
foreach(string s in options)
{
driver.FindElement(By.XPath("//input[@id='Timetable_toolbar_elementSelect']")).Clear();
driver.FindElement(By.XPath("//input[@id='Timetable_toolbar_elementSelect']")).SendKeys(s);
Thread.Sleep(2000);
driver.FindElement(By.XPath("//body")).SendKeys(Keys.Enter);
Thread.Sleep(2000);
doc.LoadHtml(driver.PageSource);
//Console.WriteLine(driver.Url); //Now we can see the id of the current Klase
}
Console.WriteLine(doc.DocumentNode.InnerHtml);
Console.ReadKey();
}
最后更新
使用 Selenium 解决方案,我能够获得所有课程的 ID。我包含了文件 here,因此您可以将其用于 ajax 和网络请求。
我有一个问题以前好像有人问过,但有点不同。我正在尝试从 this website 抓取数据,但问题是它似乎加载了 AJAX。因此,我的应用程序无法在我正在寻找的 HTML 中找到 id 和 类。
您可以通过检查元素或查看源代码来重现此内容。在查看源代码时,我看到的比检查元素时要少得多。
我想我可以找到包含 AJAX 的文件来加载此 html,方法是按 F12,转到网络选项卡并选择 XHR,但我无法找到它。
My question is: how do I retrieve this data or find out what file is used to collect the data?
我的代码示例(我找不到 Timetable_toolbar_elementSelect_popup0
):
private async Task GetHtmlDocument(string url)
{
HttpWebRequest request = (HttpWebRequest)HttpWebRequest.Create(url);
//request.Credentials = new LoginCredentials().Credentials;
try
{
WebResponse myResponse = await request.GetResponseAsync();
HtmlDocument htmlDoc = new HtmlDocument();
htmlDoc.OptionFixNestedTags = true;
htmlDoc.Load(myResponse.GetResponseStream());
var test = htmlDoc.GetElementbyId("Timetable_toolbar_elementSelect_popup0");
}
catch (Exception e)
{
}
}
我打算将此作为评论留下。但是它变得太大而且格式太糟糕。所以我们开始吧。
首先。使用 ajax 命令调用的 javascript 动态更新站点。
如果您可以打开一个会话并存储包含 SESSIONID 和现在“加密”学校名称的 cookie,那么您可以调用 ajax 命令。
https://roosters.windesheim.nl/ajaxCommand=getWeeklyTimetable&elementType=1&elementId=13090&date=20171126&formatId=7&departmentId=0&filterId=-2
然而,这确实需要您知道什么是 elementType 和什么是 elementId。
此时elementId等于1GLD时指的是Klas。当 formatID(7) 等于“Beknopt”时,它指的是 Roosterformaat。您必须弄清楚其余变量的作用。更重要的是,如果您成功地能够向服务器发出有效的 ajax 命令,那么您将不会得到 html 作为响应,您将收到 JSON.[= 中的数据。 14=]
做你想做的最简单的方法是将所有 类 放在一个单独的 file 中。并将其用作参考点。其他选项也一样。
然后使用 phantomjs.org with Selenium 这样的无头浏览器。这样你就可以找到并点击你想要抓取的类。将 html 加载到 HtmlAgilityPack.HtmlDocument 中,然后执行您需要执行的操作。 Selenium/PhantomJS 直到跟踪您的 cookie。 这种方法速度较慢 - 但更容易做到。
编辑存储来自网络请求的 cookie - 简单的方法。
我不热衷于这个话题。但是OP问。如果有人有更好的方法,请编辑。
CookieContainer cookies = new CookieContainer();
try
{
string webAddr = "https://roosters.windesheim.nl/WebUntis/";
var httpWebRequest = (HttpWebRequest)WebRequest.Create(webAddr);
httpWebRequest.ContentType = "application/json; charset=utf-8";
httpWebRequest.Method = "POST";
httpWebRequest.CookieContainer = cookies;
httpWebRequest.AutomaticDecompression = DecompressionMethods.GZip | DecompressionMethods.Deflate;
httpWebRequest.Headers.Add("X-Requested-With", "XMLHttpRequest");
using (var streamWriter = new StreamWriter(httpWebRequest.GetRequestStream()))
{
string json = "ajaxCommand=getWeeklyTimetable&elementType=1&elementId=13092&date=20171126&formatId=7&departmentId=0&filterId=-2";
streamWriter.Write(json);
streamWriter.Flush();
}
var httpResponse = (HttpWebResponse)httpWebRequest.GetResponse();
using (var streamReader = new StreamReader(httpResponse.GetResponseStream()))
{
cookies.Add(httpWebRequest.CookieContainer.GetCookies(httpWebRequest.RequestUri));
//cookies.Add(httpResponse.Cookies);
var responseText = streamReader.ReadToEnd();
doc.LoadHtml(responseText);
foreach(Cookie c in httpResponse.Cookies)
{
Console.WriteLine(c.ToString());
}
}
}
catch (WebException ex)
{
Console.WriteLine(ex.Message);
}
Console.WriteLine(doc.DocumentNode.InnerHtml);
Console.ReadKey();
使用网络请求调用 ajax 方法的解决方案。
所以我很无聊,想出了大部分。下面缺少的是如何通过 id 识别 Klase。下面的示例将获取 klase '1GLD'。我们需要 cookie 的原因是为了让请求知道我们从哪所学校获取 Klase。此外,下面的代码仅 returns JSON - 而不是 HTML 因为它是我们调用的 ajax 方法。
CookieContainer cookies = new CookieContainer();
try
{
string webAddr = "https://roosters.windesheim.nl/";
var httpWebRequest = (HttpWebRequest)WebRequest.Create(webAddr);
httpWebRequest.ContentType = "application/json; charset=utf-8";
httpWebRequest.Method = "POST";
httpWebRequest.CookieContainer = cookies;
httpWebRequest.AutomaticDecompression = DecompressionMethods.GZip | DecompressionMethods.Deflate;
httpWebRequest.Headers.Add("X-Requested-With", "XMLHttpRequest");
var httpResponse = (HttpWebResponse)httpWebRequest.GetResponse();
using (var streamReader = new StreamReader(httpResponse.GetResponseStream()))
{
cookies.Add(httpWebRequest.CookieContainer.GetCookies(httpWebRequest.RequestUri));
}
}
catch (WebException ex)
{
Console.WriteLine(ex.Message);
}
//According to my web debugger the cookie will last until the 10th of December. So need to fix a new cookie until then.
//I noticed the url used unixtimestamps at the end of the url. So we just add the unixtimestamp at the end for each request.
long unixTimeStamp = new DateTimeOffset(DateTime.Now).ToUnixTimeMilliseconds() - 100;
//we are now ready to call the ajax method and get the JSON.
try
{
string webAddr = "https://roosters.windesheim.nl/WebUntis/Timetable.do?request.preventCache="+unixTimeStamp.ToString();
var httpWebRequest = (HttpWebRequest)WebRequest.Create(webAddr);
httpWebRequest.ContentType = "application/x-www-form-urlencoded; charset=utf-8";
httpWebRequest.Method = "POST";
httpWebRequest.CookieContainer = cookies;
httpWebRequest.AutomaticDecompression = DecompressionMethods.GZip | DecompressionMethods.Deflate;
httpWebRequest.Headers.Add("X-Requested-With", "XMLHttpRequest");
using (var streamWriter = new StreamWriter(httpWebRequest.GetRequestStream()))
{
string json = "ajaxCommand=getWeeklyTimetable&elementType=1&elementId=13090&date=20171126&formatId=7&departmentId=0&filterId=-2";
//The command below will return a JSON datastructure containing all the klases and their relevant ID.
//string otherJson = "ajaxCommand=getPageConfig&type=1&filter=-2"
streamWriter.Write(json);
streamWriter.Flush();
}
var httpResponse = (HttpWebResponse)httpWebRequest.GetResponse();
using (var streamReader = new StreamReader(httpResponse.GetResponseStream()))
{
var responseText = streamReader.ReadToEnd();
//THE RESULTS GETS PRINTED HERE.
Console.Write(responseText);
}
}
catch (WebException ex)
{
Console.WriteLine(ex.Message);
}
Selenium 和 Firefox 驱动程序的其他解决方案。
这更容易做到。但这也需要一些时间。并非所有线程休眠都是必需的。这将提供 HTML 以按照您的要求与 istead 一起工作。但是我发现在最后一个foreach循环中有必要。
public static void Main(string[] args)
{
HtmlDocument doc = new HtmlDocument();
//According to my web debugger the cookie will last until the 10th of December. So need to fix a new cookie until then.
//I noticed the url used unixtimestamps at the end of the url. So we just add the unixtimestamp at the end for each request.
long unixTimeStamp = new DateTimeOffset(DateTime.Now).ToUnixTimeMilliseconds() - 100;
string webAddr = "https://roosters.windesheim.nl/WebUntis/Timetable.do?request.preventCache="+unixTimeStamp.ToString();
var ffOptions = new FirefoxOptions();
ffOptions.BrowserExecutableLocation = @"C:\Program Files (x86)\Mozilla Firefox\firefox.exe";
ffOptions.LogLevel = FirefoxDriverLogLevel.Default;
ffOptions.Profile = new FirefoxProfile { AcceptUntrustedCertificates = true };
var service = FirefoxDriverService.CreateDefaultService();
var driver = new FirefoxDriver(service, ffOptions, TimeSpan.FromSeconds(120));
driver.Navigate().GoToUrl(webAddr);
driver.FindElement(By.XPath("//input[@id='school']")).SendKeys("Windesheim"+Keys.Enter);
Thread.Sleep(2000);
driver.FindElement(By.XPath("//span[@id='dijit_PopupMenuBarItem_0_text' and text() ='Lesrooster']")).Click();
driver.FindElement(By.XPath("//td[@id='dijit_MenuItem_0_text' and text() ='Klassen']")).Click();
Thread.Sleep(2000);
driver.FindElement(By.XPath("//div[@id='widget_Timetable_toolbar_elementSelect']//input[@class='dijitReset dijitInputField dijitArrowButtonInner']")).Click();
//we get all the options for Klase
doc.LoadHtml(driver.PageSource);
HtmlNodeCollection nodes = doc.DocumentNode.SelectNodes("//div[@id='Timetable_toolbar_elementSelect_popup']/div[@item]");
List<String> options = new List<String>();
foreach (HtmlNode n in nodes)
{
options.Add(n.InnerText);
}
foreach(string s in options)
{
driver.FindElement(By.XPath("//input[@id='Timetable_toolbar_elementSelect']")).Clear();
driver.FindElement(By.XPath("//input[@id='Timetable_toolbar_elementSelect']")).SendKeys(s);
Thread.Sleep(2000);
driver.FindElement(By.XPath("//body")).SendKeys(Keys.Enter);
Thread.Sleep(2000);
doc.LoadHtml(driver.PageSource);
//Console.WriteLine(driver.Url); //Now we can see the id of the current Klase
}
Console.WriteLine(doc.DocumentNode.InnerHtml);
Console.ReadKey();
}
最后更新
使用 Selenium 解决方案,我能够获得所有课程的 ID。我包含了文件 here,因此您可以将其用于 ajax 和网络请求。