Selenium C# 网络抓取 - 无法 resolve/parse html PageSource
Selenium C# web scraping - failed to resolve/parse html PageSource
我用 C# 为自己制作了一个简单的 .NET 控制台应用程序,以使用 Selenium C# 抓取个人使用的动态页面。
selenium 导航工作得很好,但是当我要解析生成的页面源并检索房地产地址列表时,它 return 为空。最重要的是,它还会给出与 chrome 浏览器相关的警告和错误。
完整代码:
public static void SeleniumExtract()
{
// initial setup
IWebDriver driver = new ChromeDriver();
driver.Navigate().GoToUrl("https://www.knightfrank.co.uk/");
WebDriverWait wait = new WebDriverWait(driver, TimeSpan.FromSeconds(1));
// dropdown
var dropdown1 = driver.FindElement(By.Id("cpMain_ucc1_ctl00_liResidentialFront"));
dropdown1.Click();
// enter search query
var search = driver.FindElement(By.Id("cpMain_ucc1_ctl00_txtResidentialSearchBox"));
search.Click();
search.SendKeys("London");
// select search suggection from dropdown menu
wait.Until(SeleniumExtras.WaitHelpers.ExpectedConditions.ElementExists(By.Id("ui-id-4")));
var suggestion = driver.FindElement(By.Id("ui-id-4"));
suggestion.Click();
// submit search
var submit = driver.FindElement(By.XPath("//div[@id='cpMain_ucc1_ctl00_pnlContentResidential']//a[@class='search-button']"));
submit.Click();
// get the data
var elements = driver.FindElements(By.XPath("//div[@class='grid-address']"));
foreach(var item in elements)
{
Console.WriteLine(item.Text);
}
}
问题 #1 - 硒:
Selenium 不会 return 任何要控制的结果。 xpath //div[@class='grid-address']
绝对准确,所以没有拼写错误。我不知道为什么它不将 foreach 项目结果输出到控制台,即这部分代码不起作用:
// get the data
var elements = driver.FindElements(By.XPath("//div[@class='grid-address']"));
foreach(var item in elements)
{
Console.WriteLine(item.Text);
}
问题 #2 - Html 敏捷包:
或者,我尝试使用 Html Agility Pack 来解析 PageSource,它只是 returns empty null exception。
首先,我 return 来自 SeleniumExtract() 的页面源代码:
// export current pagesource
var currentPage = driver.PageSource;
return currentPage;
然后,我将页面源加载到 Html Agility Pack 中以供使用。 return没什么!
public static void HapParse(string currentPage)
{
HtmlDocument htmlDoc = new HtmlDocument();
htmlDoc.Load(currentPage);
var address = htmlDoc.DocumentNode
.SelectNodes("//div[@class='grid-address']")
.ToList();
foreach(var item in address)
{
Console.WriteLine(item.InnerText);
}
}
问题 #3 - AngleSharp:
尝试对 Angle Sharp 执行相同操作,但仍然无效。
public static async void AngleSharpParse(string currentPage)
{
var config = Configuration.Default;
var context = BrowsingContext.New(config);
var document = await context.OpenAsync(req => req.Content(currentPage));
var elements = document.QuerySelectorAll("div.grid-address");
foreach (var item in elements)
{
Console.WriteLine(item.TextContent);
}
}
问题 #4 - 与 Chrome 浏览器相关的警告和错误:
每次执行代码时,控制台也会return出现以下错误:
[39536:28224:1103/183909.271:ERROR:chrome_browser_main_extra_parts_metrics.cc(230)] crbug.com/1216328: Checking Bluetooth availability started. Please report if there is no report that this ends.
[39536:28224:1103/183909.271:ERROR:chrome_browser_main_extra_parts_metrics.cc(233)] crbug.com/1216328: Checking Bluetooth availability ended.
[39536:28224:1103/183909.271:ERROR:chrome_browser_main_extra_parts_metrics.cc(236)] crbug.com/1216328: Checking default browser status started. Please report if there is no report that this ends.
[39536:39768:1103/183909.275:ERROR:device_event_log_impl.cc(214)] [18:39:09.274] USB: usb_service_win.cc:389 Could not read device interface GUIDs: The system cannot find the file specified. (0x2)
[39536:39768:1103/183909.280:ERROR:device_event_log_impl.cc(214)] [18:39:09.280] USB: usb_device_handle_win.cc:1048
Failed to read descriptor from node connection: A device attached to the system is not functioning. (0x1F)
[39536:28224:1103/183909.291:ERROR:chrome_browser_main_extra_parts_metrics.cc(240)] crbug.com/1216328: Checking default browser status ended.
[39536:39768:1103/183909.310:ERROR:device_event_log_impl.cc(214)] [18:39:09.310] USB: usb_device_handle_win.cc:1048
Failed to read descriptor from node connection: A device attached to the system is not functioning. (0x1F)
有人能告诉我为什么这个非常简单的代码 return 没有结果吗?
p.s.
我实际上尝试使用 Selenium 和 Beautiful Soup 在 Python 中编写完全相同的代码,并且一切正常。
我在这里错过了什么?
我只需要在 Selenium 提交搜索查询后添加一个等待计时器,例如 Thread.Sleep(3000)
,让页面在解析 HTML.
之前完全加载
我用 C# 为自己制作了一个简单的 .NET 控制台应用程序,以使用 Selenium C# 抓取个人使用的动态页面。
selenium 导航工作得很好,但是当我要解析生成的页面源并检索房地产地址列表时,它 return 为空。最重要的是,它还会给出与 chrome 浏览器相关的警告和错误。
完整代码:
public static void SeleniumExtract()
{
// initial setup
IWebDriver driver = new ChromeDriver();
driver.Navigate().GoToUrl("https://www.knightfrank.co.uk/");
WebDriverWait wait = new WebDriverWait(driver, TimeSpan.FromSeconds(1));
// dropdown
var dropdown1 = driver.FindElement(By.Id("cpMain_ucc1_ctl00_liResidentialFront"));
dropdown1.Click();
// enter search query
var search = driver.FindElement(By.Id("cpMain_ucc1_ctl00_txtResidentialSearchBox"));
search.Click();
search.SendKeys("London");
// select search suggection from dropdown menu
wait.Until(SeleniumExtras.WaitHelpers.ExpectedConditions.ElementExists(By.Id("ui-id-4")));
var suggestion = driver.FindElement(By.Id("ui-id-4"));
suggestion.Click();
// submit search
var submit = driver.FindElement(By.XPath("//div[@id='cpMain_ucc1_ctl00_pnlContentResidential']//a[@class='search-button']"));
submit.Click();
// get the data
var elements = driver.FindElements(By.XPath("//div[@class='grid-address']"));
foreach(var item in elements)
{
Console.WriteLine(item.Text);
}
}
问题 #1 - 硒:
Selenium 不会 return 任何要控制的结果。 xpath //div[@class='grid-address']
绝对准确,所以没有拼写错误。我不知道为什么它不将 foreach 项目结果输出到控制台,即这部分代码不起作用:
// get the data
var elements = driver.FindElements(By.XPath("//div[@class='grid-address']"));
foreach(var item in elements)
{
Console.WriteLine(item.Text);
}
问题 #2 - Html 敏捷包:
或者,我尝试使用 Html Agility Pack 来解析 PageSource,它只是 returns empty null exception。
首先,我 return 来自 SeleniumExtract() 的页面源代码:
// export current pagesource
var currentPage = driver.PageSource;
return currentPage;
然后,我将页面源加载到 Html Agility Pack 中以供使用。 return没什么!
public static void HapParse(string currentPage)
{
HtmlDocument htmlDoc = new HtmlDocument();
htmlDoc.Load(currentPage);
var address = htmlDoc.DocumentNode
.SelectNodes("//div[@class='grid-address']")
.ToList();
foreach(var item in address)
{
Console.WriteLine(item.InnerText);
}
}
问题 #3 - AngleSharp:
尝试对 Angle Sharp 执行相同操作,但仍然无效。
public static async void AngleSharpParse(string currentPage)
{
var config = Configuration.Default;
var context = BrowsingContext.New(config);
var document = await context.OpenAsync(req => req.Content(currentPage));
var elements = document.QuerySelectorAll("div.grid-address");
foreach (var item in elements)
{
Console.WriteLine(item.TextContent);
}
}
问题 #4 - 与 Chrome 浏览器相关的警告和错误:
每次执行代码时,控制台也会return出现以下错误:
[39536:28224:1103/183909.271:ERROR:chrome_browser_main_extra_parts_metrics.cc(230)] crbug.com/1216328: Checking Bluetooth availability started. Please report if there is no report that this ends.
[39536:28224:1103/183909.271:ERROR:chrome_browser_main_extra_parts_metrics.cc(233)] crbug.com/1216328: Checking Bluetooth availability ended.
[39536:28224:1103/183909.271:ERROR:chrome_browser_main_extra_parts_metrics.cc(236)] crbug.com/1216328: Checking default browser status started. Please report if there is no report that this ends.
[39536:39768:1103/183909.275:ERROR:device_event_log_impl.cc(214)] [18:39:09.274] USB: usb_service_win.cc:389 Could not read device interface GUIDs: The system cannot find the file specified. (0x2)
[39536:39768:1103/183909.280:ERROR:device_event_log_impl.cc(214)] [18:39:09.280] USB: usb_device_handle_win.cc:1048
Failed to read descriptor from node connection: A device attached to the system is not functioning. (0x1F)
[39536:28224:1103/183909.291:ERROR:chrome_browser_main_extra_parts_metrics.cc(240)] crbug.com/1216328: Checking default browser status ended.
[39536:39768:1103/183909.310:ERROR:device_event_log_impl.cc(214)] [18:39:09.310] USB: usb_device_handle_win.cc:1048
Failed to read descriptor from node connection: A device attached to the system is not functioning. (0x1F)
有人能告诉我为什么这个非常简单的代码 return 没有结果吗?
p.s.
我实际上尝试使用 Selenium 和 Beautiful Soup 在 Python 中编写完全相同的代码,并且一切正常。
我在这里错过了什么?
我只需要在 Selenium 提交搜索查询后添加一个等待计时器,例如 Thread.Sleep(3000)
,让页面在解析 HTML.