Web 抓取错误 - 站点所有者错误:站点密钥的域无效
Web Scraping Error - ERROR for site owner: Invalid domain for site key
我试图获取此 URL 的内容 - https://www.zillow.com/homedetails/131-Avenida-Dr-Berkeley-CA-94708/24844204_zpid/
我用的是刮擦的。这是我的代码。
import scrapy
class QuotesSpider(scrapy.Spider):
name = "quotes"
start_urls = [
'https://www.zillow.com/homedetails/131-Avenida-Dr-Berkeley-CA-94708/24844204_zpid/',
]
def parse(self, response):
filename = 'test.html'
with open(filename, 'wb') as f:
f.write(response.body)
self.log('Saved file %s' % filename)
我打开了抓取的数据(test.html),我得到了这个内容。
我试图找到解决方案并尝试了这个 -
但这并没有解决我的问题。
首先,试试这个方法,看看是否可行:
Headerz = {
"accept": "text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.9",
"accept-encoding": "gzip, deflate, br",
"accept-language": "en-US,en;q=0.9",
"cache-control": "no-cache",
"content-type": "application/x-www-form-urlencoded; charset=UTF-8",
"pragma": "no-cache",
"sec-fetch-mode": "navigate",
"sec-fetch-site": "cross-site",
"sec-fetch-user": "?1",
"upgrade-insecure-requests": "1",
"user-agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/79.0.3945.130 Safari/537.36",
}
class QuotesSpider(scrapy.Spider):
name = "quotes"
start_urls = [
'https://www.zillow.com/homedetails/131-Avenida-Dr-Berkeley-CA-94708/24844204_zpid/',
]
def start_requests(self):
yield scrapy.Request(start_urls[0], callback=self.parse, headers=Headerz)
def parse(self, response):
filename = 'test.html'
with open(filename, 'wb') as f:
f.write(response.body)
self.log('Saved file %s' % filename)
我们之所以看不到在普通浏览器中看到的输出,是因为我们没有使用正确的 headers,否则总是由浏览器发送。
您需要按照上述代码添加 headers 或在 settings.py.
中更新它们
更好的方法是同时使用 'rotating-proxies' 个存储库和 'rotating-user-agent' 个存储库。
我试图获取此 URL 的内容 - https://www.zillow.com/homedetails/131-Avenida-Dr-Berkeley-CA-94708/24844204_zpid/ 我用的是刮擦的。这是我的代码。
import scrapy
class QuotesSpider(scrapy.Spider):
name = "quotes"
start_urls = [
'https://www.zillow.com/homedetails/131-Avenida-Dr-Berkeley-CA-94708/24844204_zpid/',
]
def parse(self, response):
filename = 'test.html'
with open(filename, 'wb') as f:
f.write(response.body)
self.log('Saved file %s' % filename)
我打开了抓取的数据(test.html),我得到了这个内容。
首先,试试这个方法,看看是否可行:
Headerz = {
"accept": "text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.9",
"accept-encoding": "gzip, deflate, br",
"accept-language": "en-US,en;q=0.9",
"cache-control": "no-cache",
"content-type": "application/x-www-form-urlencoded; charset=UTF-8",
"pragma": "no-cache",
"sec-fetch-mode": "navigate",
"sec-fetch-site": "cross-site",
"sec-fetch-user": "?1",
"upgrade-insecure-requests": "1",
"user-agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/79.0.3945.130 Safari/537.36",
}
class QuotesSpider(scrapy.Spider):
name = "quotes"
start_urls = [
'https://www.zillow.com/homedetails/131-Avenida-Dr-Berkeley-CA-94708/24844204_zpid/',
]
def start_requests(self):
yield scrapy.Request(start_urls[0], callback=self.parse, headers=Headerz)
def parse(self, response):
filename = 'test.html'
with open(filename, 'wb') as f:
f.write(response.body)
self.log('Saved file %s' % filename)
我们之所以看不到在普通浏览器中看到的输出,是因为我们没有使用正确的 headers,否则总是由浏览器发送。
您需要按照上述代码添加 headers 或在 settings.py.
中更新它们更好的方法是同时使用 'rotating-proxies' 个存储库和 'rotating-user-agent' 个存储库。