从 C# .NET Core 控制台应用程序中的浏览器页面检查器捕获数据

Question

我的 C# .NET Core 控制台应用程序是一个简单的网络爬虫。在源代码中包含所需数据的页面上，我能够访问所需数据。在可以从 window 复制数据的页面中，在浏览器的页面检查器中查看，但不能在源代码中查看，我被卡住了。

请提供我如何获取此数据的代码示例。

我当前的捕获代码如下：

var htmlCode = string.empty;
using (WebClient client = new WebClient()) // WebClient class inherits IDisposable
{
     // Get the file content without saving it
     htmlCode = client.DownloadString("https://www.wedj.com/dj-photo-video.nsf/firstdance.html");
}

使用上面的代码，您会收到如下所示的源代码：

从浏览器检查器中看到的图 1 中显示的数据隐藏在

中

<div class="entry row">

Answer 1

阅读有关 C# 的 Selenium 自动化工具，但它会打开您想要删除的每个网页，然后打开例如 return 源代码或在该网页上执行一些操作。

一般来说，这个工具不是 (afaik) 用于网络爬虫，但在开始时可能会很好，特别是如果你的 dotnet 核心应用程序位于某个虚拟机上/docker。

但请注意，通过浏览器打开不安全的页面可能会有风险。

Answer 2

您可能想试试 puppeteer sharp。它允许您获取当前 HTML 状态。

using (var page = await browser.NewPageAsync())
{
    await page.GoToAsync("http://www.spapage.com");
    var result = await page.GetContentAsync();
}

https://github.com/kblok/puppeteer-sharp

Answer 3

实现所需内容的方法很少（考虑 C# 控制台应用程序）。

也许最简单的方法是使用与浏览器实例交互的工具，即 Selenium（用于单元测试）。所以：

安装Selenium.WebDriver nuget 包
安装一个浏览器，您的应用程序将运行（假设chrome）
下载浏览器驱动程序(chromedriver)

这样写：

IWebDriver driver = null;
try
{
    ChromeOptions options = new ChromeOptions();
    options.AddArguments("--incognito");

    driver = new ChromeDriver(options);
    driver.Manage().Timeouts().ImplicitWait = TimeSpan.FromSeconds(5);
    driver.Url = "https://www.wedj.com/dj-photo-video.nsf/firstdance.html";

    var musicTable = driver.FindElement(By.Id("musicTable"));
    // interact with driver to get data from the page.
 }
 finally
 {
    if (driver != null)
       driver.Dispose();
 }

否则，您需要多调查一下网页的工作原理。据我所知，该页面加载了 javascript、https://www.wedj.com/dj-photo-video.nsf/musiclist.js，它负责从服务器加载音乐列表。这个 js 脚本基本上从以下 url: https://www.wedj.com/gbmusic.nsf/musicList?open&wedj=1&list=category_firstdance&count=100 加载数据（您也可以在浏览器中打开它）。排除“(”和“)”，结果是 json 你可以解析（可能使用 newtonsoft.json 包）：

{
  "more": "yes",
  "title": "<h1>Most Requested Wedding First Dance Songs<\/h...",
  "event": "<table class='musicTable g6-table-all g6-small' id='musicTable' borde..."
}

事件属性包含您需要的数据（可以使用HtmlAgilityPack nuget包解析）

临硒：

易于互动
该行为与您在浏览器中看到的相同

缺点硒：

您需要 chrome 或安装其他浏览器
当您与浏览器交互时，浏览器运行正在运行
浏览器下载整页（图片、html、js、css...）

专业手册：

你只加载你需要的东西
不依赖于外部程序（即浏览器）

缺点手册：

您需要了解 html/js 的工作原理
您需要手动解析 json/html

在这种情况下，我更喜欢第二种选择。

从 C# .NET Core 控制台应用程序中的浏览器页面检查器捕获数据

Capturing data from browser page inspector in C# .NET Core Console Application

html

c#

web-inspector

.net-core