Scrapy/第一个请求页面使用 Scrapy Selenium?
Scrapy / Use Scrapy Selenium for the first request-page?
我有一个 运行 解决方案,使用 scrapy_selenium 用于具有 javascript 加载的站点。
正如您在下面的代码中看到的那样,在使用 parseDetails -
生成 detailPage 时使用了 SeleniumRequest
但是当我需要在我的主页上准备好 SeleniumRequest(而不仅仅是下面的详细信息页面)时,我该怎么办?
在那种情况下我如何使用 SeleniumRequest?
import scrapy
from scrapy_selenium import SeleniumRequest
class ZoosSpider(scrapy.Spider):
name = 'zoos'
allowed_domains = ['www.tripadvisor.co.uk']
start_urls = [
"https://www.tripadvisor.co.uk/Attractions-g186216-Activities-c53-a_allAttractions.true-United_Kingdom.html"
]
existList = []
def parse(self, response):
tmpSEC = response.xpath("//section[@data-automation='AppPresentation_SingleFlexCardSection']")
for elem in tmpSEC:
link = response.urljoin(elem.xpath(".//a/@href").get())
yield SeleniumRequest(
url=link,
wait_time= 10,
callback=self.parseDetails)
def parseDetails(self, response):
tmpName = response.xpath("//h1[@data-automation='mainH1']/text()").get()
tmpLink = response.xpath("//div[@class='Lvkmj']/a/@href").getall()
tmpURL = tmpTelnr = tmpMail = "N/A"
yield {
"Name": tmpName,
"URL": tmpURL,
}
您可以使用自己的函数 start_requests()
启动第一个请求。
class ZoosSpider(scrapy.Spider):
def start_requests(self):
for link in self.start_urls:
yield SeleniumRequest(
url=link,
wait_time= 10,
callback=self.parse)
请参阅文档中的第一点:Spider
The first requests to perform are obtained by calling the start_requests() method
which (by default) generates Request for the URLs specified in the start_urls
and the parse method as callback function for the Requests.
我有一个 运行 解决方案,使用 scrapy_selenium 用于具有 javascript 加载的站点。 正如您在下面的代码中看到的那样,在使用 parseDetails -
生成 detailPage 时使用了 SeleniumRequest但是当我需要在我的主页上准备好 SeleniumRequest(而不仅仅是下面的详细信息页面)时,我该怎么办?
在那种情况下我如何使用 SeleniumRequest?
import scrapy
from scrapy_selenium import SeleniumRequest
class ZoosSpider(scrapy.Spider):
name = 'zoos'
allowed_domains = ['www.tripadvisor.co.uk']
start_urls = [
"https://www.tripadvisor.co.uk/Attractions-g186216-Activities-c53-a_allAttractions.true-United_Kingdom.html"
]
existList = []
def parse(self, response):
tmpSEC = response.xpath("//section[@data-automation='AppPresentation_SingleFlexCardSection']")
for elem in tmpSEC:
link = response.urljoin(elem.xpath(".//a/@href").get())
yield SeleniumRequest(
url=link,
wait_time= 10,
callback=self.parseDetails)
def parseDetails(self, response):
tmpName = response.xpath("//h1[@data-automation='mainH1']/text()").get()
tmpLink = response.xpath("//div[@class='Lvkmj']/a/@href").getall()
tmpURL = tmpTelnr = tmpMail = "N/A"
yield {
"Name": tmpName,
"URL": tmpURL,
}
您可以使用自己的函数 start_requests()
启动第一个请求。
class ZoosSpider(scrapy.Spider):
def start_requests(self):
for link in self.start_urls:
yield SeleniumRequest(
url=link,
wait_time= 10,
callback=self.parse)
请参阅文档中的第一点:Spider
The first requests to perform are obtained by calling the start_requests() method
which (by default) generates Request for the URLs specified in the start_urls
and the parse method as callback function for the Requests.