这个 robot.txt 是什么意思？

Question

有一个网站需要爬取，我没有经济目的只是为了学习。

我查看了robots.txt，结果如下

User-agent: *

Allow: /

Disallow: /*.notfound.html

我可以使用 request 和 beautifulSoup 抓取该网站吗？

我检查过没有 header 的抓取会导致 403 错误。这是否意味着不允许抓取？

Answer 1

状态码：403表示client-side错误，来自server-side这样的类型错误不负责意思是允许网站提取数据。要摆脱 403 错误，您必须需要使用 headers 之类的请求注入一些东西，大多数情况下，但并非总是如此，只需将 User-Agent 作为 header 注入即可解决此问题。这是一个示例，如何使用 requests 模块和 BeautifulSoup.

注入 User-Agent

import requests
from bs4 import BeautifulSoup

header = {
    "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.77 Safari/537.36"
}

response = requests.get("Your url", headers=headers)
print(response.status_code)

#soup = BeautifulSoup(response .content, "lxml")

这个 robot.txt 是什么意思？

What does this robot.txt mean?

python

robots.txt

beautifulsoup