在 Xamarin 中使用 HtmlAgilityPack 等待 AJAX

Await AJAX with HtmlAgilityPack in Xamarin

我有一个问题以前好像有人问过,但有点不同。我正在尝试从 this website 抓取数据,但问题是它似乎加载了 AJAX。因此,我的应用程序无法在我正在寻找的 HTML 中找到 id 和 类。

您可以通过检查元素或查看源代码来重现此内容。在查看源代码时,我看到的比检查元素时要少得多。

我想我可以找到包含 AJAX 的文件来加载此 html,方法是按 F12,转到网络选项卡并选择 XHR,但我无法找到它。

My question is: how do I retrieve this data or find out what file is used to collect the data?

我的代码示例(我找不到 Timetable_toolbar_elementSelect_popup0):

private async Task GetHtmlDocument(string url)
        {
            HttpWebRequest request = (HttpWebRequest)HttpWebRequest.Create(url);
            //request.Credentials = new LoginCredentials().Credentials;

            try
            {
                WebResponse myResponse = await request.GetResponseAsync();
                HtmlDocument htmlDoc = new HtmlDocument();
                htmlDoc.OptionFixNestedTags = true;
                htmlDoc.Load(myResponse.GetResponseStream());
                var test = htmlDoc.GetElementbyId("Timetable_toolbar_elementSelect_popup0");
            }
            catch (Exception e)
            {
            }
        }

我打算将此作为评论留下。但是它变得太大而且格式太糟糕。所以我们开始吧。

首先。使用 ajax 命令调用的 javascript 动态更新站点。

如果您可以打开一个会话并存储包含 SESSIONID 和现在“加密”学校名称的 cookie,那么您可以调用 ajax 命令。

    https://roosters.windesheim.nl/ajaxCommand=getWeeklyTimetable&elementType=1&elementId=13090&date=20171126&formatId=7&departmentId=0&filterId=-2

然而,这确实需要您知道什么是 elementType 和什么是 elementId。

此时elementId等于1GLD时指的是Klas。当 formatID(7) 等于“Beknopt”时,它指的是 Roosterformaat。您必须弄清楚其余变量的作用。更重要的是,如果您成功地能够向服务器发出有效的 ajax 命令,那么您将不会得到 html 作为响应,您将收到 JSON.[= 中的数据。 14=]

做你想做的最简单的方法是将所有 类 放在一个单独的 file 中。并将其用作参考点。其他选项也一样。

然后使用 phantomjs.org with Selenium 这样的无头浏览器。这样你就可以找到并点击你想要抓取的类。将 html 加载到 HtmlAgilityPack.HtmlDocument 中,然后执行您需要执行的操作。 Selenium/PhantomJS 直到跟踪您的 cookie。 这种方法速度较慢 - 但更容易做到。

编辑存储来自网络请求的 cookie - 简单的方法。

我不热衷于这个话题。但是OP问。如果有人有更好的方法,请编辑。

CookieContainer cookies = new CookieContainer();
try
{
    string webAddr = "https://roosters.windesheim.nl/WebUntis/";

    var httpWebRequest = (HttpWebRequest)WebRequest.Create(webAddr);
    httpWebRequest.ContentType = "application/json; charset=utf-8";
    httpWebRequest.Method = "POST";
    httpWebRequest.CookieContainer = cookies;
          
    httpWebRequest.AutomaticDecompression = DecompressionMethods.GZip | DecompressionMethods.Deflate;
    httpWebRequest.Headers.Add("X-Requested-With", "XMLHttpRequest");
    using (var streamWriter = new StreamWriter(httpWebRequest.GetRequestStream()))
    {
        string json = "ajaxCommand=getWeeklyTimetable&elementType=1&elementId=13092&date=20171126&formatId=7&departmentId=0&filterId=-2";

        streamWriter.Write(json);
        streamWriter.Flush();
    }
            

    var httpResponse = (HttpWebResponse)httpWebRequest.GetResponse();
    using (var streamReader = new StreamReader(httpResponse.GetResponseStream()))
    {
        cookies.Add(httpWebRequest.CookieContainer.GetCookies(httpWebRequest.RequestUri));
        //cookies.Add(httpResponse.Cookies);
        var responseText = streamReader.ReadToEnd();
        doc.LoadHtml(responseText);
        foreach(Cookie c in httpResponse.Cookies)
        {
            Console.WriteLine(c.ToString());
        } 
    }
}
catch (WebException ex)
{
    Console.WriteLine(ex.Message);
}
    Console.WriteLine(doc.DocumentNode.InnerHtml);

    Console.ReadKey();

使用网络请求调用 ajax 方法的解决方案。

所以我很无聊,想出了大部分。下面缺少的是如何通过 id 识别 Klase。下面的示例将获取 klase '1GLD'。我们需要 cookie 的原因是为了让请求知道我们从哪所学校获取 Klase。此外,下面的代码仅 returns JSON - 而不是 HTML 因为它是我们调用的 ajax 方法。

CookieContainer cookies = new CookieContainer();
try
{
    string webAddr = "https://roosters.windesheim.nl/";
    var httpWebRequest = (HttpWebRequest)WebRequest.Create(webAddr);
    httpWebRequest.ContentType = "application/json; charset=utf-8";
    httpWebRequest.Method = "POST";
    httpWebRequest.CookieContainer = cookies;        
    httpWebRequest.AutomaticDecompression = DecompressionMethods.GZip | DecompressionMethods.Deflate;
    httpWebRequest.Headers.Add("X-Requested-With", "XMLHttpRequest");

    var httpResponse = (HttpWebResponse)httpWebRequest.GetResponse();
    using (var streamReader = new StreamReader(httpResponse.GetResponseStream()))
    {
        cookies.Add(httpWebRequest.CookieContainer.GetCookies(httpWebRequest.RequestUri));
    }
}
catch (WebException ex)
{
    Console.WriteLine(ex.Message);
}

//According to my web debugger the cookie will last until the 10th of December. So need to fix a new cookie until then.
//I noticed the url used unixtimestamps at the end of the url. So we just add the unixtimestamp at the end for each request.
long unixTimeStamp = new DateTimeOffset(DateTime.Now).ToUnixTimeMilliseconds() - 100;

//we are now ready to call the ajax method and get the JSON.
try
{
    string webAddr = "https://roosters.windesheim.nl/WebUntis/Timetable.do?request.preventCache="+unixTimeStamp.ToString();
    var httpWebRequest = (HttpWebRequest)WebRequest.Create(webAddr);
    httpWebRequest.ContentType = "application/x-www-form-urlencoded; charset=utf-8";
    httpWebRequest.Method = "POST";
    httpWebRequest.CookieContainer = cookies;
    httpWebRequest.AutomaticDecompression = DecompressionMethods.GZip | DecompressionMethods.Deflate;
    httpWebRequest.Headers.Add("X-Requested-With", "XMLHttpRequest");

    using (var streamWriter = new StreamWriter(httpWebRequest.GetRequestStream()))
    {
        string json = "ajaxCommand=getWeeklyTimetable&elementType=1&elementId=13090&date=20171126&formatId=7&departmentId=0&filterId=-2";

        //The command below will return a JSON datastructure containing all the klases and their relevant ID.
        //string otherJson = "ajaxCommand=getPageConfig&type=1&filter=-2"


        streamWriter.Write(json);
        streamWriter.Flush();
    }


    var httpResponse = (HttpWebResponse)httpWebRequest.GetResponse();
    using (var streamReader = new StreamReader(httpResponse.GetResponseStream()))
    {
        var responseText = streamReader.ReadToEnd();
        //THE RESULTS GETS PRINTED HERE.
        Console.Write(responseText);
    }
}
catch (WebException ex)
{
    Console.WriteLine(ex.Message);
}

Selenium 和 Firefox 驱动程序的其他解决方案。

这更容易做到。但这也需要一些时间。并非所有线程休眠都是必需的。这将提供 HTML 以按照您的要求与 istead 一起工作。但是我发现在最后一个foreach循环中有必要。

public static void Main(string[] args)
{
    HtmlDocument doc = new HtmlDocument();
    //According to my web debugger the cookie will last until the 10th of December. So need to fix a new cookie until then.
    //I noticed the url used unixtimestamps at the end of the url. So we just add the unixtimestamp at the end for each request.
    long unixTimeStamp = new DateTimeOffset(DateTime.Now).ToUnixTimeMilliseconds() - 100;
    string webAddr = "https://roosters.windesheim.nl/WebUntis/Timetable.do?request.preventCache="+unixTimeStamp.ToString();
    var ffOptions = new FirefoxOptions();
    ffOptions.BrowserExecutableLocation = @"C:\Program Files (x86)\Mozilla Firefox\firefox.exe";
    ffOptions.LogLevel = FirefoxDriverLogLevel.Default;
    ffOptions.Profile = new FirefoxProfile { AcceptUntrustedCertificates = true };
    var service = FirefoxDriverService.CreateDefaultService();

    var driver = new FirefoxDriver(service, ffOptions, TimeSpan.FromSeconds(120));


    driver.Navigate().GoToUrl(webAddr);


    driver.FindElement(By.XPath("//input[@id='school']")).SendKeys("Windesheim"+Keys.Enter);
    Thread.Sleep(2000);
    driver.FindElement(By.XPath("//span[@id='dijit_PopupMenuBarItem_0_text' and text() ='Lesrooster']")).Click();

    driver.FindElement(By.XPath("//td[@id='dijit_MenuItem_0_text' and text() ='Klassen']")).Click();
    Thread.Sleep(2000);

    driver.FindElement(By.XPath("//div[@id='widget_Timetable_toolbar_elementSelect']//input[@class='dijitReset dijitInputField dijitArrowButtonInner']")).Click();

    //we get all the options for Klase
    doc.LoadHtml(driver.PageSource);
    HtmlNodeCollection nodes = doc.DocumentNode.SelectNodes("//div[@id='Timetable_toolbar_elementSelect_popup']/div[@item]");
    List<String> options = new List<String>();
    foreach (HtmlNode n in nodes)
    {
        options.Add(n.InnerText);
    }

    foreach(string s in options)
    {
        driver.FindElement(By.XPath("//input[@id='Timetable_toolbar_elementSelect']")).Clear();
        driver.FindElement(By.XPath("//input[@id='Timetable_toolbar_elementSelect']")).SendKeys(s);
        Thread.Sleep(2000);
        driver.FindElement(By.XPath("//body")).SendKeys(Keys.Enter);
        Thread.Sleep(2000);
        doc.LoadHtml(driver.PageSource);
        //Console.WriteLine(driver.Url); //Now we can see the id of the current Klase
    }

    Console.WriteLine(doc.DocumentNode.InnerHtml);

    Console.ReadKey();
}

最后更新

使用 Selenium 解决方案,我能够获得所有课程的 ID。我包含了文件 here,因此您可以将其用于 ajax 和网络请求。