我怎样才能只为导航菜单抓取一个网站

How can I scrape a website for the nav menu only

我正在构建一个抓取网站的程序。它会查看整个网站并仅获取该网站的 header 和页脚导航菜单,然后插入新的 html 标签(div、p、table 等)在 header 和页脚菜单之间。

我正在寻找有关 如何仅删除 header 和页脚导航菜单以及在两者之间添加代码的一些想法.

我正在使用 HTML Agility Pack 并且研究了一些方法。

方法一:

In most cases, the header and footer navigation menus are mostly links, and have very little text. I used a threshold variable that was a ratio of text to links. If the ratio text:links for a node is less than the threshold, the node would be considered a menu node, and it would be saved. Any node whose text:links ratio was greater than the threshold value would be removed.

方法 1 适用于某些网站,但不适用于其他网站,因此我放弃了它。

方法二:

I searched each node for an id or class attribute that included "nav" or "menu". "n","a","v", "m","e","n","u" could have been upper case or lower case, and "nav" and "menu" could have been surrounded by any combination of characters. That way, it would include id's and classes such as "bottomNav", "navRight1", "LeftMenu2", etc. If the id or class contained either "nav" or "menu", the node would be saved. If the node's attributes did not contain either of those terms, or any of the node's descendants did not contain either of those terms, the node would be deleted.

同样,方法 2 适用于某些网站,但不适用于其他网站。

对于这些方法都有效的站点,我仍然无法在两个菜单之间放置新的 html 代码,因为我无法分辨 header 的位置菜单结束,页脚菜单开始的位置。

我只是在寻找其他想法,了解如何从网站上仅抓取 header 和页脚导航菜单,并在两者之间插入新的 html 代码。

除了查找特定元素或元素 类(headernav、...)之外,您可以尝试以不同的方式查看问题:

  • 首先,从每个网站获取并解析两个(或更多)页面,最好检查它们是否有显着差异(但不是完全不同);
  • 然后,做一个差异(DOM,最好),只保留共同的结构。

这个通用结构应该主要由页眉、页脚、导航栏和其他元素组成,这些元素在每个网站上都或多或少保持不变。

最后一步可能是在这个通用结构中寻找由 headers/footers 引起的小间隙,这些间隙因上下文而异,而不是由不同(主要)内容引起的大间隙,并抓取它们的可能值从您可以从每个网站获取的最大页面集。