分页网页抓取不会 return 所有结果
Web scraping with pagination doesn't return all results
我正在尝试抓取 Indeed.com 但遇到分页问题。
这是我的代码:
import scrapy
class JobsNySpider(scrapy.Spider):
name = 'jobs_ny'
allowed_domains = ['www.indeed.com']
start_urls = ['https://www.indeed.com/jobs?q=analytics&l=New%20York%2C%20NY&vjk=7b2f6385304ffc78']
def parse(self, response):
jobs = response.xpath("//td[@id='resultsCol']")
for job in jobs:
yield {
'Job_title': job.xpath(".//td[@class='resultContent']/div/h2/span/text()").get(),
'Company_name': job.xpath(".//span[@class='companyName']/a/text()").get(),
'Company_rating': job.xpath(".//span[@class='ratingNumber']/span/text()").get(),
'Company_location': job.xpath(".//div[@class='companyLocation']/text()").get(),
'Estimated_salary': job.xpath(".//span[@class='estimated-salary']/span/text()").get()
}
next_page = response.urljoin(response.xpath("//a[@aria-label='Next']/@href").get())
if next_page:
yield scrapy.Request(url=next_page, callback=self.parse)
问题是,根据 Indeed,有 28,789 个职位符合我的查询。
但是,当我将抓取的内容保存到 csv 文件时,只有 76 行。
我也试过:
next_page = response.urljoin(response.xpath("//ul[@class='pagination-list']/li[position() = last()]/a/ @href").get())
但结果是相似的。
所以我的问题是我在处理分页时做错了什么。
- 问题不在于分页,而是你只能从每一页获得一份工作。
- 最好在if语句之后做
urljoin
,以免出错。
import scrapy
class JobsNySpider(scrapy.Spider):
name = 'jobs_ny'
allowed_domains = ['www.indeed.com']
start_urls = ['https://www.indeed.com/jobs?q=analytics&l=New%20York%2C%20NY&vjk=7b2f6385304ffc78']
def parse(self, response):
jobs = response.xpath('//div[@id="mosaic-provider-jobcards"]/a')
for job in jobs:
yield {
'Job_title': job.xpath(".//td[@class='resultContent']/div/h2/span/text()").get(),
'Company_name': job.xpath(".//span[@class='companyName']/a/text()").get(),
'Company_rating': job.xpath(".//span[@class='ratingNumber']/span/text()").get(),
'Company_location': job.xpath(".//div[@class='companyLocation']/text()").get(),
'Estimated_salary': job.xpath(".//span[@class='estimated-salary']/span/text()").get()
}
next_page = response.xpath("//a[@aria-label='Next']/@href").get()
if next_page:
next_page = response.urljoin(next_page)
yield scrapy.Request(url=next_page, callback=self.parse)
我正在尝试抓取 Indeed.com 但遇到分页问题。 这是我的代码:
import scrapy
class JobsNySpider(scrapy.Spider):
name = 'jobs_ny'
allowed_domains = ['www.indeed.com']
start_urls = ['https://www.indeed.com/jobs?q=analytics&l=New%20York%2C%20NY&vjk=7b2f6385304ffc78']
def parse(self, response):
jobs = response.xpath("//td[@id='resultsCol']")
for job in jobs:
yield {
'Job_title': job.xpath(".//td[@class='resultContent']/div/h2/span/text()").get(),
'Company_name': job.xpath(".//span[@class='companyName']/a/text()").get(),
'Company_rating': job.xpath(".//span[@class='ratingNumber']/span/text()").get(),
'Company_location': job.xpath(".//div[@class='companyLocation']/text()").get(),
'Estimated_salary': job.xpath(".//span[@class='estimated-salary']/span/text()").get()
}
next_page = response.urljoin(response.xpath("//a[@aria-label='Next']/@href").get())
if next_page:
yield scrapy.Request(url=next_page, callback=self.parse)
问题是,根据 Indeed,有 28,789 个职位符合我的查询。 但是,当我将抓取的内容保存到 csv 文件时,只有 76 行。 我也试过: next_page = response.urljoin(response.xpath("//ul[@class='pagination-list']/li[position() = last()]/a/ @href").get()) 但结果是相似的。 所以我的问题是我在处理分页时做错了什么。
- 问题不在于分页,而是你只能从每一页获得一份工作。
- 最好在if语句之后做
urljoin
,以免出错。
import scrapy
class JobsNySpider(scrapy.Spider):
name = 'jobs_ny'
allowed_domains = ['www.indeed.com']
start_urls = ['https://www.indeed.com/jobs?q=analytics&l=New%20York%2C%20NY&vjk=7b2f6385304ffc78']
def parse(self, response):
jobs = response.xpath('//div[@id="mosaic-provider-jobcards"]/a')
for job in jobs:
yield {
'Job_title': job.xpath(".//td[@class='resultContent']/div/h2/span/text()").get(),
'Company_name': job.xpath(".//span[@class='companyName']/a/text()").get(),
'Company_rating': job.xpath(".//span[@class='ratingNumber']/span/text()").get(),
'Company_location': job.xpath(".//div[@class='companyLocation']/text()").get(),
'Estimated_salary': job.xpath(".//span[@class='estimated-salary']/span/text()").get()
}
next_page = response.xpath("//a[@aria-label='Next']/@href").get()
if next_page:
next_page = response.urljoin(next_page)
yield scrapy.Request(url=next_page, callback=self.parse)