如何抓取下一页的项目
How to scrape next page's items
你好,我是编程和 scrapy 的新手。尝试学习 scrapy 我尝试抓取一些项目。但无法抓取下一页项目,请帮助如何解析此网站的下一个 link url。
这是我的代码:
import scrapy
from scrapy.linkextractors import LinkExtractor
class BdJobs(scrapy.Spider):
name = 'jobs'
allowed_domains = ['Jobs.com']
start_urls = [
'http://jobs.com/',
]
#rules=( Rule(LinkExtractor(allow()), callback='parse', follow=True))
def parse(self, response):
for title in response.xpath('//div[@class="job-title-text"]/a'):
yield {
'titles': title.xpath('./text()').extract()[0].strip()
}
nextPageLink:
for grab the next url here is the inspect Element url:
https://08733078838609164420.googlegroups.com/attach/58c611bdb536b/bdjobs.png?part=0.1&view=1&vt=ANaJVrEDQr4PODzoOkFRO_fLhL2ZF3x-Mts4XJ8m8qb2RSX1b4n6kv0E-62A2yvw0HkBjrmUOwCrFpMBk_h8UYSWDO6hZXyt-N2brbcYwtltG-A6NiHeaGc
Here is output:
{"titles": "Senior Software Engineer (.Net)"},
{"titles": "Java programmer"},
{"titles": "VLSI Design Engineer (Japan)"},
{"titles": "Assistant Executive (Computer Lab-Evening programs)"},
{"titles": "IT Officer, Business System Management"},
{"titles": "Executive, IT"},
{"titles": "Officer, IT"},
{"titles": "Laravel PHP Developer"},
{"titles": "Executive - IT (EDISON Footwear)"},
{"titles": "Software Engineer (PHP/ MySQL)"},
{"titles": "Software Engineer [Back End]"},
{"titles": "Full Stack Developer"},
{"titles": "Mobile Application Developer (iOS/ Android)"},
{"titles": "Head of IT Security Operations"},
{"titles": "Database Administrator, Senior Analyst"},
{"titles": "Infrastructure Delivery Senior Analyst, Network Security"},
{"titles": "Head of IT Support Operations"},
{"titles": "Hardware Engineer"},
{"titles": "JavaScript/ Coffee Script Programmer"},
{"titles": "Trainer - Auto CAD"},
{"titles": "ASSISTENT PRODUCTION OFFICER"},
{"titles": "Customer Relationship Executive"},
{"titles": "Head of Sales"},
{"titles": "Sample Master"},
{"titles": "Manager/ AGM (Finance & Accounts)"},
{"titles": "Night Aiditor"},
{"titles": "Officer- Poultry"},
{"titles": "Business Analyst"},
{"titles": "Sr. Executive - Sales & Marketing (Sewing Thread)"},
{"titles": "Civil Engineer"},
{"titles": "Executive Director-HR"},
{"titles": "Sr. Executive (MIS & Internal Audit)"},
{"titles": "Manager, Health & Safety"},
{"titles": "Computer Engineer (Diploma)"},
{"titles": "Sr. Manager/ Manager, Procurement"},
{"titles": "Specialist, Content"},
{"titles": "Manager, Warranty and Maintenance"},
{"titles": "Asst. Manager - Compliance"},
{"titles": "Officer/Sr. Officer/Asst. Manager (Store)"},
{"titles": "Manager, Maintenance (Sewing)"}
不要使用 start_urls
,它令人困惑。
使用start_requests
函数,Spider一启动就会调用这个函数。
class BdJobs(scrapy.Spider):
name = 'bdjobs'
allowed_domains = ['BdJobs.com']
def start_requests(self):
urls = ['http://jobs.bdjobs.com/','http://jobs.bdjobs.com/jobsearch.asp?fcatId=8&icatId=']
for url in urls:
yield Request(url,self.parse_detail_page)
def parse_detail_page(self, response):
for title in response.xpath('//div[@class="job-title-text"]/a'):
yield {
'titles': title.xpath('./text()').extract()[0].strip()
}
# TODO
nextPageLink = GET NEXT PAGE LINK HERE
yield Request(nextPageLink,self.parse_detail_page)
请注意,您必须在 nextPageLink
中抓取下一页 link。
你好,我是编程和 scrapy 的新手。尝试学习 scrapy 我尝试抓取一些项目。但无法抓取下一页项目,请帮助如何解析此网站的下一个 link url。
这是我的代码:
import scrapy
from scrapy.linkextractors import LinkExtractor
class BdJobs(scrapy.Spider):
name = 'jobs'
allowed_domains = ['Jobs.com']
start_urls = [
'http://jobs.com/',
]
#rules=( Rule(LinkExtractor(allow()), callback='parse', follow=True))
def parse(self, response):
for title in response.xpath('//div[@class="job-title-text"]/a'):
yield {
'titles': title.xpath('./text()').extract()[0].strip()
}
nextPageLink:
for grab the next url here is the inspect Element url:
https://08733078838609164420.googlegroups.com/attach/58c611bdb536b/bdjobs.png?part=0.1&view=1&vt=ANaJVrEDQr4PODzoOkFRO_fLhL2ZF3x-Mts4XJ8m8qb2RSX1b4n6kv0E-62A2yvw0HkBjrmUOwCrFpMBk_h8UYSWDO6hZXyt-N2brbcYwtltG-A6NiHeaGc
Here is output:
{"titles": "Senior Software Engineer (.Net)"},
{"titles": "Java programmer"},
{"titles": "VLSI Design Engineer (Japan)"},
{"titles": "Assistant Executive (Computer Lab-Evening programs)"},
{"titles": "IT Officer, Business System Management"},
{"titles": "Executive, IT"},
{"titles": "Officer, IT"},
{"titles": "Laravel PHP Developer"},
{"titles": "Executive - IT (EDISON Footwear)"},
{"titles": "Software Engineer (PHP/ MySQL)"},
{"titles": "Software Engineer [Back End]"},
{"titles": "Full Stack Developer"},
{"titles": "Mobile Application Developer (iOS/ Android)"},
{"titles": "Head of IT Security Operations"},
{"titles": "Database Administrator, Senior Analyst"},
{"titles": "Infrastructure Delivery Senior Analyst, Network Security"},
{"titles": "Head of IT Support Operations"},
{"titles": "Hardware Engineer"},
{"titles": "JavaScript/ Coffee Script Programmer"},
{"titles": "Trainer - Auto CAD"},
{"titles": "ASSISTENT PRODUCTION OFFICER"},
{"titles": "Customer Relationship Executive"},
{"titles": "Head of Sales"},
{"titles": "Sample Master"},
{"titles": "Manager/ AGM (Finance & Accounts)"},
{"titles": "Night Aiditor"},
{"titles": "Officer- Poultry"},
{"titles": "Business Analyst"},
{"titles": "Sr. Executive - Sales & Marketing (Sewing Thread)"},
{"titles": "Civil Engineer"},
{"titles": "Executive Director-HR"},
{"titles": "Sr. Executive (MIS & Internal Audit)"},
{"titles": "Manager, Health & Safety"},
{"titles": "Computer Engineer (Diploma)"},
{"titles": "Sr. Manager/ Manager, Procurement"},
{"titles": "Specialist, Content"},
{"titles": "Manager, Warranty and Maintenance"},
{"titles": "Asst. Manager - Compliance"},
{"titles": "Officer/Sr. Officer/Asst. Manager (Store)"},
{"titles": "Manager, Maintenance (Sewing)"}
不要使用 start_urls
,它令人困惑。
使用start_requests
函数,Spider一启动就会调用这个函数。
class BdJobs(scrapy.Spider):
name = 'bdjobs'
allowed_domains = ['BdJobs.com']
def start_requests(self):
urls = ['http://jobs.bdjobs.com/','http://jobs.bdjobs.com/jobsearch.asp?fcatId=8&icatId=']
for url in urls:
yield Request(url,self.parse_detail_page)
def parse_detail_page(self, response):
for title in response.xpath('//div[@class="job-title-text"]/a'):
yield {
'titles': title.xpath('./text()').extract()[0].strip()
}
# TODO
nextPageLink = GET NEXT PAGE LINK HERE
yield Request(nextPageLink,self.parse_detail_page)
请注意,您必须在 nextPageLink
中抓取下一页 link。