robots.txt 会阻止人类收集数据吗？

Question

我知道 robots.txt 是一个用于 "robots" 的文件，或者我应该说 "automated crawler"。但是，它是否会阻止人工输入 "forbidden" 页面并手动收集数据？

也许举个例子更清楚：我无法抓取此页面：

https://www.drivy.com/search?address=Gare+de+Li%C3%A8ge-Guillemins&address_source=&poi_id=&latitude=50.6251&longitude=5.5659&city_display_name=&start_date=2019-04-06&start_time=06%3A00&end_date=2019-04-07&end_time=06%3A00&country_scope=BE

我还能通过我的网络浏览器的开发者工具 "manually" 获取包含数据的 JSON 文件吗？

Answer 1

robots.txt 文件是指南，它们不会阻止任何人或机器访问任何内容。

为 Scrapy 项目生成的默认 settings.py 文件将 ROBOTSTXT_OBEY 设置为 True。如果您愿意，可以将其设置为 False。

请注意，尽管如此，网站可能会采用反抓取措施来防止您抓取这些页面。但那是另一个话题。

Answer 2

基于original robots.txt specification from 1994，robots.txt中的规则只针对机器人（大胆强调我的）：

WWW Robots (also called wanderers or spiders) are programs that traverse many pages in the World Wide Web by recursively retrieving linked pages.

[…]

These incidents indicated the need for established mechanisms for WWW servers to indicate to robots which parts of their server should not be accessed.

因此，机器人是自动检索其他文档中的文档 linked/referenced 的程序。

如果有人检索文档（使用浏览器或其他程序），或者如果有人将手动收集的 URL 列表提供给某个程序（并且该程序不 add/follow检索到的文档），robots.txt 中的规则不适用。

FAQ“What is a WWW robot?”证实了这一点：

Normal Web browsers are not robots, because they are operated by a human, and don't automatically retrieve referenced documents (other than inline images).

robots.txt 会阻止人类收集数据吗？

Does robots.txt prevent humans to gather data?

browser

robots.txt

scrapy