使用 nodejs 对非结构化 html 进行爬虫

Question

我需要 crawl/scrap 静态非结构化 HTML，我正在尝试使用 nodejs 代码获取内容，我尝试使用 cheerio 和 xpath 失败。

http://static.puertos.es/pred_simplificada/Predolas/Tablas/Cnt/PAS.html

要获取的第一个元素的 Xpath 是 /html/body/center/center/table/tbody/tr[3] 然后我需要获取 TR 中的每个 TD 文本。

如果尝试获取tbody节点

      var parser = new parse5.Parser();
      var document = parser.parse(response.toString());
      var xhtml = xmlser.serializeToString(document);
      var doc = new dom().parseFromString(xhtml);
      var select = xpath.useNamespaces({"x": "http://www.w3.org/1999/xhtml"});
      var nodes = select("//x:tbody", doc);

我总是收到 [] 个节点。

我尝试使用 cheerio 迭代 TR 元素，但正如我上面提到的，没有成功。

var $ = cheerio.load(response);
$('tr').each(function(i, e) {
    console.log("Content %j", $(e));
});

Answer 1

使用选项 所有小写标签 因为 HTML 可能混合包含 tr 和 TR:

 $ = cheerio.load(html, { lowerCaseTags: true });

您也应该对属性执行相同的操作：

 $ = cheerio.load(html, { lowerCaseTags: true, lowerCaseAttributeNames : true });

希望对您有所帮助。

Answer 2

在没有 CSS HTML 的情况下，cheerio 无法正常工作。因此，我在 that tutorial

之后尝试使用 YQL 的另一种解决方法

select * from html where url='http://static.puertos.es/pred_simplificada/Predolas/Tablas/Cnt/PAS.html' and xpath='//html/body/center/center/table/tbody'

有了 yql，我得到了我需要的东西，所以我会整合它 node-yql

使用 nodejs 对非结构化 html 进行爬虫

Crawler over unstructured html with nodejs

xpath

web-crawler

node.js

cheerio