Javascript 是否可以生成 Cheerio 无法提取的 DOM html？

Question

我正在尝试从此网页中提取价格：https://www.allbirds.com/products/mens-wool-runner-up-mizzles-natural-grey?size=13

我将范围缩小到这些 div：

<div class="jsx-3947815802 Container">
<div class="jsx-526902087 Grid">
<div class="jsx-2943457050 Grid__cell Grid__cell--small-12 Grid__cell--medium-7 Grid__cell--large-up-8">...

class 名称的 jsx-{random_number} 对我来说很可疑。它们似乎是即时生成的。我需要的价格在这些 div 中。但是，这些不存在于页面源代码和/或我在运行时使用的 cheerio 对象中。它就这么消失了。

这种技术有多普遍？这似乎是停止网络抓取工具的好方法。我该如何解决？

Answer 1

如果那些类是随机的，可能会很烦人，但它不是 deal-breaker，因为 other 类看起来保持静止。

例如，包含价格的元素类似于：

<p class="jsx-3188494938 Paragraph PdpMasterProductDetails__paragraph">5</p>

PdpMasterProductDetails__paragraph不变。因此，您可以将其用作选择器来检索文本：

$('.PdpMasterProductDetails__paragraph').text()

您还可以从元标记中检索价格：

<meta property="og:price:amount" content="135">

可以通过选择器字符串选择：

meta[property="og:price:amount"]

Answer 2

How common is this technique?

非常。

使用 React 等工具将网站构建为单页应用程序非常普遍。

It seems like a pretty good way to stop web scrapers.

不是。

How do I get around it?

点击 Web 服务，React 代码直接从中获取原始数据。通过浏览器开发人员工具中的“网络”选项卡可以轻松发现它。

Javascript 是否可以生成 Cheerio 无法提取的 DOM html？

Is it possible for Javascript to generate a DOM html that is unextractable by Cheerio?

html

javascript

jquery

cheerio