使用 contains 和 Cheeriogs 抓取 url 值

Question

我使用 Cheeriogs 库进行抓取：

这是我需要收集值的元素href:

<a class="tnmscn" itemprop="url" href="/en/predictions-tips-wealdstone-solihull-moors-1455115">

这是我目前用来提取值的代码。:

const contentText = UrlFetchApp.fetch(url).getContentText();
const $ = Cheerio.load(contentText);

const scrapurl = $('div.schema > div > div.tnms > div > a.tnmscn');
const urlmatch = $(scrapurl).attr('href').trim();
Logger.log(urlmatch);

但它不可靠，因为我担心最终会改变网站上的位置并收集除出现在该位置的可点击元素中的链接以外的其他链接：

所以我想让它更安全，所以我尝试使用：

div.schema > div > div.tnms > div > a:contains("/en/predictions-tips")

那没用。我应该如何使用 contains 来满足这个需求？

添加信息：

页数Link
https://www.forebet.com/en/teams/wealdstone

图像到元素

Answer 1

在您的情况下，以下选择器怎么样？

发件人：

const scrapurl = $('div.schema > div > div.tnms > div > a.tnmscn');

收件人：

const scrapurl = $('a.tnmscn[href^="/en/predictions"]');

或

const scrapurl = $('div.schema > div > div.tnms > div > a.tnmscn[href^="/en/predictions"]');

或

const scrapurl = $('div.schema > div > div.tnms > div > a[href^="/en/predictions"]');

在上述所有修改的脚本中，/en/predictions-tips-wealdstone-solihull-moors-1455115被检索。
在上面的选择器中，标签a中href的起始文本和带有classtnmscn的标签a是/en/predictions.

但是，从您正在使用的 URL 中，检索到 2 个值。已经提到了这一点。所以我认为当你想检索第一个值时，可以使用上面对你的脚本的修改。

如果要检索2个值，下面的修改怎么样？

修改后的脚本：

本次修改，上面修改的选择器也可以使用

const url = "https://www.forebet.com/en/teams/wealdstone";
const contentText = UrlFetchApp.fetch(url).getContentText();
const $ = Cheerio.load(contentText);
const scrapurl = $('div.schema > div > div.tnms > div > a.tnmscn[href^="/en/predictions"]'); // and a.tnmscn[href^="/en/predictions"]
$(scrapurl).each(function() {
  const urlmatch = $(this).attr('href');
  console.log(urlmatch);
});

当此脚本为运行时，得到如下结果

  /en/predictions-tips-wealdstone-solihull-moors-1455115
  /en/predictions-tips-crawley-town-leyton-orient-1474259

使用 contains 和 Cheeriogs 抓取 url 值

Scraping a url value using contains and Cheeriogs

web-scraping

google-apps-script

cheerio

发件人：

收件人：

修改后的脚本：

参考文献：