response.css html 在 Scrapy 上
response.css html on Scrapy
我正在尝试从此处的“网站”中提取 html link
https://www.tripadvisor.com/Restaurant_Review-g982688-d1140717-Reviews-Strandkanten-Karlskoga_Orebro_County.html
https://monosnap.com/file/agSNP29XoLDlG4HZtntaaifAtFPzcH
我试过 response.css('a.dOGcA::attr(href)').extract()
但它给出了一个空白的回应
我究竟做错了什么?
谢谢!
您要抓取的 url 是动态加载的。如果你禁用 javascript 那么你会看到 url/href 从 html Dom 中消失,这就是为什么我将 SeleniumRequest
与 scrapy 一起使用并获得所需的输出.
代码:
import scrapy
from scrapy_selenium import SeleniumRequest
class LinkSpider(scrapy.Spider):
name = 'link'
def start_requests(self):
url = 'https://www.tripadvisor.com/Restaurant_Review-g982688-d1140717-Reviews-Strandkanten-Karlskoga_Orebro_County.html'
yield SeleniumRequest(
url=url,
wait_time=3,
callback=self.parse)
def parse(self, response):
yield {'Link':response.xpath('//a[@class="dOGcA Ci Wc _S C fhGHT"]/@href').get()}
def spider_closed(self):
self.driver.close()
输出:
2021-10-24 01:25:39 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.tripadvisor.com/Restaurant_Review-g982688-d1140717-Reviews-Strandkanten-Karlskoga_Orebro_County.html>
{'Link': 'http://www.strandkanten.nu'}
我正在尝试从此处的“网站”中提取 html link https://www.tripadvisor.com/Restaurant_Review-g982688-d1140717-Reviews-Strandkanten-Karlskoga_Orebro_County.html https://monosnap.com/file/agSNP29XoLDlG4HZtntaaifAtFPzcH 我试过 response.css('a.dOGcA::attr(href)').extract() 但它给出了一个空白的回应 我究竟做错了什么? 谢谢!
您要抓取的 url 是动态加载的。如果你禁用 javascript 那么你会看到 url/href 从 html Dom 中消失,这就是为什么我将 SeleniumRequest
与 scrapy 一起使用并获得所需的输出.
代码:
import scrapy
from scrapy_selenium import SeleniumRequest
class LinkSpider(scrapy.Spider):
name = 'link'
def start_requests(self):
url = 'https://www.tripadvisor.com/Restaurant_Review-g982688-d1140717-Reviews-Strandkanten-Karlskoga_Orebro_County.html'
yield SeleniumRequest(
url=url,
wait_time=3,
callback=self.parse)
def parse(self, response):
yield {'Link':response.xpath('//a[@class="dOGcA Ci Wc _S C fhGHT"]/@href').get()}
def spider_closed(self):
self.driver.close()
输出:
2021-10-24 01:25:39 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.tripadvisor.com/Restaurant_Review-g982688-d1140717-Reviews-Strandkanten-Karlskoga_Orebro_County.html>
{'Link': 'http://www.strandkanten.nu'}