Python 请求抓取 URL returns 在浏览器中工作时出现 404 错误
Python request to crawl URL returns 404 Error while working inside the browser
我有一个挂在 url 上的爬行 python 脚本:pulsepoint.com/sellers.json
bot 使用标准请求获取内容,但返回错误 404。在浏览器中它可以正常工作(有 301 重定向,但请求可以跟随)。我的第一个预感是这可能是一个请求 header 问题,所以我复制了我的浏览器配置。代码看起来像这样
crawled_url="pulsepoint.com"
seller_json_url = 'http://{thehost}/sellers.json'.format(thehost=crawled_url)
print(seller_json_url)
myheaders = {
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:74.0) Gecko/20100101 Firefox/74.0',
'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8',
'Accept-Language': 'fr,fr-FR;q=0.8,en-US;q=0.5,en;q=0.3',
'Accept-Encoding': 'gzip, deflate, br',
'Connection': 'keep-alive',
'Pragma': 'no-cache',
'Cache-Control': 'no-cache'
}
r = requests.get(seller_json_url, headers=myheaders)
logging.info(" %d" % r.status_code)
但我仍然遇到 404 错误。
我的下一个猜测:
- 登录?此处未使用
- 饼干?不是我能看到
他们的服务器是如何阻止我的机器人的?这是一个url应该是顺便爬的,没有违法的..
提前致谢!
您可以直接进入link并提取数据,无需将301获取到正确的link
import requests
headers = {"Upgrade-Insecure-Requests": "1"}
response = requests.get(
url="https://projects.contextweb.com/sellersjson/sellers.json",
headers=headers,
verify=False,
)
您还可以解决 SSL 证书错误,如下所示:
from urllib.request import urlopen
import ssl
import json
#this is a workaround on the SSL error
ssl._create_default_https_context = ssl._create_unverified_context
crawled_url="pulsepoint.com"
seller_json_url = 'http://{thehost}/sellers.json'.format(thehost=crawled_url)
print(seller_json_url)
response = urlopen(seller_json_url).read()
# print in dictionary format
print(json.loads(response))
响应示例:
{'contact_email': 'PublisherSupport@pulsepoint.com', 'contact_address': '360 Madison Ave, 14th Floor, NY, NY, 10017', 'version': '1.0', 'identifiers': [{'name': 'TAG-ID', 'value': '89ff185a4c4e857c'}], 'sellers': [{'seller_id': '508738', . ..
...'seller_type': 'PUBLISHER'}, {'seller_id': '562225', 'name': 'EL DIARIO', 'domain': 'impremedia.com', 'seller_type': 'PUBLISHER'}]}
好的,只是为了其他人,âńōŋŷXmoůŜ 的答案的强化版本,因为:
- 有些网站要headers回答;
- 一些网站使用了奇怪的编码
- 某些网站在未请求时发送 gzip 压缩的答案。
import urllib
import ssl
import json
from io import BytesIO
import gzip
ssl._create_default_https_context = ssl._create_unverified_context
crawled_url="pulsepoint.com"
seller_json_url = 'http://{thehost}/sellers.json'.format(thehost=crawled_url)
req = urllib.request.Request(seller_json_url)
# ADDING THE HEADERS
req.add_header('User-Agent', 'Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:74.0) Gecko/20100101 Firefox/74.0')
req.add_header('Accept','application/json,text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8')
response = urllib.request.urlopen(req)
data=response.read()
# IN CASE THE ANSWER IS GZIPPED
if response.info().get('Content-Encoding') == 'gzip':
buf = BytesIO(data)
f = gzip.GzipFile(fileobj=buf)
data = f.read()
# ADAPTS THE ENCODING TO THE ANSWER
print(json.loads(data.decode(response.info().get_param('charset') or 'utf-8')))
再次感谢!
我有一个挂在 url 上的爬行 python 脚本:pulsepoint.com/sellers.json
bot 使用标准请求获取内容,但返回错误 404。在浏览器中它可以正常工作(有 301 重定向,但请求可以跟随)。我的第一个预感是这可能是一个请求 header 问题,所以我复制了我的浏览器配置。代码看起来像这样
crawled_url="pulsepoint.com"
seller_json_url = 'http://{thehost}/sellers.json'.format(thehost=crawled_url)
print(seller_json_url)
myheaders = {
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:74.0) Gecko/20100101 Firefox/74.0',
'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8',
'Accept-Language': 'fr,fr-FR;q=0.8,en-US;q=0.5,en;q=0.3',
'Accept-Encoding': 'gzip, deflate, br',
'Connection': 'keep-alive',
'Pragma': 'no-cache',
'Cache-Control': 'no-cache'
}
r = requests.get(seller_json_url, headers=myheaders)
logging.info(" %d" % r.status_code)
但我仍然遇到 404 错误。
我的下一个猜测:
- 登录?此处未使用
- 饼干?不是我能看到
他们的服务器是如何阻止我的机器人的?这是一个url应该是顺便爬的,没有违法的..
提前致谢!
您可以直接进入link并提取数据,无需将301获取到正确的link
import requests
headers = {"Upgrade-Insecure-Requests": "1"}
response = requests.get(
url="https://projects.contextweb.com/sellersjson/sellers.json",
headers=headers,
verify=False,
)
您还可以解决 SSL 证书错误,如下所示:
from urllib.request import urlopen
import ssl
import json
#this is a workaround on the SSL error
ssl._create_default_https_context = ssl._create_unverified_context
crawled_url="pulsepoint.com"
seller_json_url = 'http://{thehost}/sellers.json'.format(thehost=crawled_url)
print(seller_json_url)
response = urlopen(seller_json_url).read()
# print in dictionary format
print(json.loads(response))
响应示例:
{'contact_email': 'PublisherSupport@pulsepoint.com', 'contact_address': '360 Madison Ave, 14th Floor, NY, NY, 10017', 'version': '1.0', 'identifiers': [{'name': 'TAG-ID', 'value': '89ff185a4c4e857c'}], 'sellers': [{'seller_id': '508738', . ..
...'seller_type': 'PUBLISHER'}, {'seller_id': '562225', 'name': 'EL DIARIO', 'domain': 'impremedia.com', 'seller_type': 'PUBLISHER'}]}
好的,只是为了其他人,âńōŋŷXmoůŜ 的答案的强化版本,因为:
- 有些网站要headers回答;
- 一些网站使用了奇怪的编码
- 某些网站在未请求时发送 gzip 压缩的答案。
import urllib
import ssl
import json
from io import BytesIO
import gzip
ssl._create_default_https_context = ssl._create_unverified_context
crawled_url="pulsepoint.com"
seller_json_url = 'http://{thehost}/sellers.json'.format(thehost=crawled_url)
req = urllib.request.Request(seller_json_url)
# ADDING THE HEADERS
req.add_header('User-Agent', 'Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:74.0) Gecko/20100101 Firefox/74.0')
req.add_header('Accept','application/json,text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8')
response = urllib.request.urlopen(req)
data=response.read()
# IN CASE THE ANSWER IS GZIPPED
if response.info().get('Content-Encoding') == 'gzip':
buf = BytesIO(data)
f = gzip.GzipFile(fileobj=buf)
data = f.read()
# ADAPTS THE ENCODING TO THE ANSWER
print(json.loads(data.decode(response.info().get_param('charset') or 'utf-8')))
再次感谢!