访问被拒绝:您无权访问此服务器上的 "http://www.airbnb.ca/rooms/48058366/"
Access Denied: You don\'t have permission to access "http://www.airbnb.ca/rooms/48058366/" on this server
有什么办法可以解决这个错误吗?我正在使用 splash 获取 HTML,但返回的 response.body 拒绝访问。我可以在chrome开发者工具中查看数据,但是由于这个错误无法提取HTML。另外,当我只使用 splash 时,我看到了完整的 HTML!我把我的 github link 给任何人:
https://github.com/ryanshrott/scraping/tree/master/demo_airbnb
访问Denied\n\n
拒绝访问
\n\n您没有访问“http://www.airbnb.ca/rooms/48058366的权限/”在此服务器上。\n参考#18.66cc94d1.1643648347。66b47664\n\n\n
'
import scrapy
from scrapy_splash import SplashRequest
class SimpleSpider(scrapy.Spider):
name = 'simple'
allowed_domains = ['airbnb.ca']
script = '''function main(splash, args)
assert(splash:go(args.url))
assert(splash:wait(0.5))
return {
html = splash:html(),
}
end'''
headers = {'User-Agent': 'Mozilla/5.0 (X11; Linux x86_64) '\
'AppleWebKit/537.36 (KHTML, like Gecko) '\
'Chrome/75.0.3770.80 Safari/537.36'}
def start_requests(self):
yield SplashRequest(
url='https://www.airbnb.ca/rooms/48058366/',
callback=self.parse,
args={"lua_source": self.script},
headers = self.headers
)
def parse(self, response):
yield { 'body' : response.body,
'title': response.xpath("//h2[@class='_14i3z6h']/text()").get()}
使用 lua 脚本时,您需要将请求发送到 execute
端点,如下面的代码所示。此外,当使用 scrapy_splash
时,请务必在 settings.py
文件或 custom_settings
spider 参数中包含所需的值,如下所示:
import json
import scrapy
from scrapy_splash import SplashRequest
class SimpleSpider(scrapy.Spider):
name = 'simple'
allowed_domains = ['airbnb.ca']
custom_settings = dict(
SPLASH_URL = 'http://localhost:8050',
DOWNLOADER_MIDDLEWARES = {
'scrapy_splash.SplashCookiesMiddleware': 723,
'scrapy_splash.SplashMiddleware': 725,
'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware': 810,
},
SPIDER_MIDDLEWARES = {
'scrapy_splash.SplashDeduplicateArgsMiddleware': 100,
},
DUPEFILTER_CLASS = 'scrapy_splash.SplashAwareDupeFilter',
)
script = '''function main(splash, args)
assert(splash:go(args.url))
assert(splash:wait(0.5))
return splash:html()
end'''
headers = {'User-Agent': 'Mozilla/5.0 (X11; Linux x86_64) '\
'AppleWebKit/537.36 (KHTML, like Gecko) '\
'Chrome/75.0.3770.80 Safari/537.36'}
def start_requests(self):
yield SplashRequest(
url='https://www.airbnb.ca/rooms/48058366/',
callback=self.parse,
args={"lua_source": self.script},
endpoint='execute',
headers = self.headers
)
def parse(self, response):
data = response.xpath("//*[@id='data-deferred-state']/text()").get()
yield json.loads(data)
如果你 运行 使用 scrapy crawl simple
或 scrapy runspider simple.py
的蜘蛛,你会得到以下输出
有什么办法可以解决这个错误吗?我正在使用 splash 获取 HTML,但返回的 response.body 拒绝访问。我可以在chrome开发者工具中查看数据,但是由于这个错误无法提取HTML。另外,当我只使用 splash 时,我看到了完整的 HTML!我把我的 github link 给任何人: https://github.com/ryanshrott/scraping/tree/master/demo_airbnb
访问Denied\n\n
拒绝访问
\n\n您没有访问“http://www.airbnb.ca/rooms/48058366的权限/”在此服务器上。\n参考#18.66cc94d1.1643648347。66b47664\n\n\n
'import scrapy
from scrapy_splash import SplashRequest
class SimpleSpider(scrapy.Spider):
name = 'simple'
allowed_domains = ['airbnb.ca']
script = '''function main(splash, args)
assert(splash:go(args.url))
assert(splash:wait(0.5))
return {
html = splash:html(),
}
end'''
headers = {'User-Agent': 'Mozilla/5.0 (X11; Linux x86_64) '\
'AppleWebKit/537.36 (KHTML, like Gecko) '\
'Chrome/75.0.3770.80 Safari/537.36'}
def start_requests(self):
yield SplashRequest(
url='https://www.airbnb.ca/rooms/48058366/',
callback=self.parse,
args={"lua_source": self.script},
headers = self.headers
)
def parse(self, response):
yield { 'body' : response.body,
'title': response.xpath("//h2[@class='_14i3z6h']/text()").get()}
使用 lua 脚本时,您需要将请求发送到 execute
端点,如下面的代码所示。此外,当使用 scrapy_splash
时,请务必在 settings.py
文件或 custom_settings
spider 参数中包含所需的值,如下所示:
import json
import scrapy
from scrapy_splash import SplashRequest
class SimpleSpider(scrapy.Spider):
name = 'simple'
allowed_domains = ['airbnb.ca']
custom_settings = dict(
SPLASH_URL = 'http://localhost:8050',
DOWNLOADER_MIDDLEWARES = {
'scrapy_splash.SplashCookiesMiddleware': 723,
'scrapy_splash.SplashMiddleware': 725,
'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware': 810,
},
SPIDER_MIDDLEWARES = {
'scrapy_splash.SplashDeduplicateArgsMiddleware': 100,
},
DUPEFILTER_CLASS = 'scrapy_splash.SplashAwareDupeFilter',
)
script = '''function main(splash, args)
assert(splash:go(args.url))
assert(splash:wait(0.5))
return splash:html()
end'''
headers = {'User-Agent': 'Mozilla/5.0 (X11; Linux x86_64) '\
'AppleWebKit/537.36 (KHTML, like Gecko) '\
'Chrome/75.0.3770.80 Safari/537.36'}
def start_requests(self):
yield SplashRequest(
url='https://www.airbnb.ca/rooms/48058366/',
callback=self.parse,
args={"lua_source": self.script},
endpoint='execute',
headers = self.headers
)
def parse(self, response):
data = response.xpath("//*[@id='data-deferred-state']/text()").get()
yield json.loads(data)
如果你 运行 使用 scrapy crawl simple
或 scrapy runspider simple.py
的蜘蛛,你会得到以下输出