如何在使用无限滚动加载的页面中抓取完整列表,其中请求的 URL 每次都相同
How to scrape the full list in a page loaded using infinite scrolling, where the URL requested is the same everytime
我还在做我最初的几个 scrapy 项目,我发现这个网站有一个无限滚动,请求的 URL 每次都是一样的。我试图寻找解决方案,但我阅读的所有 material 都涉及 URL,但有一些区别(页码、文本等)。我该如何提取 https://www.baincapital.com/people
中出现的所有名称。我已经弄清楚了我的选择器等,但它只是返回最初可见的信息。任何帮助将不胜感激。
到目前为止我的代码:
import scrapy
from scrapy_splash import SplashRequest
class BainPeople(scrapy.Spider):
name = 'BainPeop'
start_urls = [
'https://www.baincapital.com/people'
]
def start_requests(self):
for url in self.start_urls:
yield SplashRequest(url=url, callback = self.parse, args={"wait" : 3})
def parse(self, response):
name = response.css('h4 span::text').extract()
links = response.css('div.col-xs-6.col-sm-4.col-md-6.col-lg-3.grid.staff a::attr(href)').extract()
yield {'name' : name}
更新代码:
import scrapy
from selenium import webdriver
class BainpeopleSpider(scrapy.Spider):
name = 'bainpeople'
allowed_domains = ['https://www.baincapital.com/people']
start_urls = ['http://www.baincapital.com/people/']
def parse(self, response):
driver = webdriver.Chrome(executable_path='C:/Users/uchit.madhok/Downloads/chromedriver_win32/chromedriver')
driver.get('http://www.baincapital.com/people/')
name = driver.find_elements_by_css_selector("h4 span").text
links = driver.find_elements_by_css_selector('div.col-xs-6.col-sm-4.col-md-6.col-lg-3.grid.staff a').attr(href)
yield {
'name' : name
'links' : links
}
driver.close()
最终代码:
import scrapy
from selenium import webdriver
from selenium.webdriver.common.keys import Keys
import time
class BainpeopleSpider(scrapy.Spider):
name = 'bainpeople'
allowed_domains = ['baincapital.com']
start_urls = ['http://www.baincapital.com/people/']
def parse(self, response):
browser = webdriver.Chrome(executable_path='C:/Users/uchit.madhok/Downloads/chromedriver_win32/chromedriver')
browser.get('http://www.baincapital.com/people/')
elm = browser.find_element_by_tag_name('html')
i = 30
while i>0:
elm.send_keys(Keys.END)
time.sleep(8)
elm.send_keys(Keys.HOME)
i = i-1
links = list(map(lambda x: x.get_attribute('href'), browser.find_elements_by_css_selector('div.col-xs-6.col-sm-4.col-md-6.col-lg-3.grid.staff a')))
for j in links:
yield response.follow(str(j), callback = self.parse_detail)
def parse_detail(self, response):
name = response.css('h1.pageTitle::text').extract()
title = response.css('div.__location::text')[0].extract()
team = response.css('div.__location::text')[1].extract()
location = response.css('div.__location::text')[2].extract()
about = response.css('div.field-item.even p::text').extract()
sector = response.css('ul.focus_link a::text').extract()
yield {
'name' : name,
'title' : title,
'team' : team,
'location' : location,
'about' : about,
'sector' : sector
}
你想做的事情可能单独使用 Scrapy 是不可能的。访问动态数据是一个众所周知的问题,但幸运的是有解决方案。其中之一是硒。在这里您可以看到如何使用它从页面访问动态数据以及如何将它与 Scrapy 集成:selenium with scrapy for dynamic page
我还在做我最初的几个 scrapy 项目,我发现这个网站有一个无限滚动,请求的 URL 每次都是一样的。我试图寻找解决方案,但我阅读的所有 material 都涉及 URL,但有一些区别(页码、文本等)。我该如何提取 https://www.baincapital.com/people
中出现的所有名称。我已经弄清楚了我的选择器等,但它只是返回最初可见的信息。任何帮助将不胜感激。
到目前为止我的代码:
import scrapy
from scrapy_splash import SplashRequest
class BainPeople(scrapy.Spider):
name = 'BainPeop'
start_urls = [
'https://www.baincapital.com/people'
]
def start_requests(self):
for url in self.start_urls:
yield SplashRequest(url=url, callback = self.parse, args={"wait" : 3})
def parse(self, response):
name = response.css('h4 span::text').extract()
links = response.css('div.col-xs-6.col-sm-4.col-md-6.col-lg-3.grid.staff a::attr(href)').extract()
yield {'name' : name}
更新代码:
import scrapy
from selenium import webdriver
class BainpeopleSpider(scrapy.Spider):
name = 'bainpeople'
allowed_domains = ['https://www.baincapital.com/people']
start_urls = ['http://www.baincapital.com/people/']
def parse(self, response):
driver = webdriver.Chrome(executable_path='C:/Users/uchit.madhok/Downloads/chromedriver_win32/chromedriver')
driver.get('http://www.baincapital.com/people/')
name = driver.find_elements_by_css_selector("h4 span").text
links = driver.find_elements_by_css_selector('div.col-xs-6.col-sm-4.col-md-6.col-lg-3.grid.staff a').attr(href)
yield {
'name' : name
'links' : links
}
driver.close()
最终代码:
import scrapy
from selenium import webdriver
from selenium.webdriver.common.keys import Keys
import time
class BainpeopleSpider(scrapy.Spider):
name = 'bainpeople'
allowed_domains = ['baincapital.com']
start_urls = ['http://www.baincapital.com/people/']
def parse(self, response):
browser = webdriver.Chrome(executable_path='C:/Users/uchit.madhok/Downloads/chromedriver_win32/chromedriver')
browser.get('http://www.baincapital.com/people/')
elm = browser.find_element_by_tag_name('html')
i = 30
while i>0:
elm.send_keys(Keys.END)
time.sleep(8)
elm.send_keys(Keys.HOME)
i = i-1
links = list(map(lambda x: x.get_attribute('href'), browser.find_elements_by_css_selector('div.col-xs-6.col-sm-4.col-md-6.col-lg-3.grid.staff a')))
for j in links:
yield response.follow(str(j), callback = self.parse_detail)
def parse_detail(self, response):
name = response.css('h1.pageTitle::text').extract()
title = response.css('div.__location::text')[0].extract()
team = response.css('div.__location::text')[1].extract()
location = response.css('div.__location::text')[2].extract()
about = response.css('div.field-item.even p::text').extract()
sector = response.css('ul.focus_link a::text').extract()
yield {
'name' : name,
'title' : title,
'team' : team,
'location' : location,
'about' : about,
'sector' : sector
}
你想做的事情可能单独使用 Scrapy 是不可能的。访问动态数据是一个众所周知的问题,但幸运的是有解决方案。其中之一是硒。在这里您可以看到如何使用它从页面访问动态数据以及如何将它与 Scrapy 集成:selenium with scrapy for dynamic page