Scrapy 无法使用 itemloader 抓取第二页
Scrapy can not scrape a second page using itemloader
更新:7/29,9:29pm:阅读后,我更新了我的代码。
更新:2015 年 7 月 28 日,在 7:35pm,根据 Martin 的建议,消息已更改,但仍然没有项目列表或写入数据库。
原始:我可以成功抓取单个页面(基页)。现在,我尝试使用请求和回调命令从 "base" 页面中找到的另一个 url 中抓取其中一项。但它不起作用。蜘蛛在这里:
from scrapy.spider import Spider
from scrapy.selector import Selector
from scrapy import Request
import re
from datetime import datetime, timedelta
from CAPjobs.items import CAPjobsItem
from CAPjobs.items import CAPjobsItemLoader
from scrapy.contrib.loader.processor import MapCompose, Join
class CAPjobSpider(Spider):
name = "naturejob3"
download_delay = 2
#allowed_domains = ["nature.com/naturejobs/"]
start_urls = [
"http://www.nature.com/naturejobs/science/jobs?utf8=%E2%9C%93&q=pathologist&where=&commit=Find+Jobs"]
def parse_subpage(self, response):
il = response.meta['il']
il.add_xpath('loc_pj', '//div[@id="extranav"]/div/dl/dd[2]/ul/li/text()')
yield il.load_item()
def parse(self, response):
hxs = Selector(response)
sites = hxs.xpath('//div[@class="job-details"]')
for site in sites:
il = CAPjobsItemLoader(CAPjobsItem(), selector = site)
il.add_xpath('title', 'h3/a/text()')
il.add_xpath('post_date', 'normalize-space(ul/li[@class="when"]/text())')
il.add_xpath('web_url', 'concat("http://www.nature.com", h3/a/@href)')
url = il.get_output_value('web_url')
yield Request(url, meta={'il': il}, callback=self.parse_subpage)
现在抓取功能部分正常,但没有 loc_pj
项:(7 月 29 日更新,7:35pm)
2015-07-29 21:28:24 [scrapy] DEBUG: Scraped from <200 http://www.nature.com/naturejobs/science/jobs/535683-assistant-associate-full-hs-clinical-clin-x-anatomic-pathology-cytopathology-11-000>
{'post_date': u'21 days ago',
'title': u'Assistant, Associate, Full (HS Clinical, Clin X) - Anatomic Pathology/Cytopathology (11-000)',
'web_url': u'http://www.nature.com/naturejobs/science/jobs/535683-assistant-associate-full-hs-clinical-clin-x-anatomic-pathology-cytopathology-11-000'}
你像这样初始化 ItemLoader
:
il = CAPjobsItemLoader(CAPjobsItem, sites)
在documentation中是这样做的:
l = ItemLoader(item=Product(), response=response)
所以我认为你在 CAPjobsItem
处缺少括号,你的行应该是:
il = CAPjobsItemLoader(CAPjobsItem(), sites)
更新:7/29,9:29pm:阅读
更新:2015 年 7 月 28 日,在 7:35pm,根据 Martin 的建议,消息已更改,但仍然没有项目列表或写入数据库。
原始:我可以成功抓取单个页面(基页)。现在,我尝试使用请求和回调命令从 "base" 页面中找到的另一个 url 中抓取其中一项。但它不起作用。蜘蛛在这里:
from scrapy.spider import Spider
from scrapy.selector import Selector
from scrapy import Request
import re
from datetime import datetime, timedelta
from CAPjobs.items import CAPjobsItem
from CAPjobs.items import CAPjobsItemLoader
from scrapy.contrib.loader.processor import MapCompose, Join
class CAPjobSpider(Spider):
name = "naturejob3"
download_delay = 2
#allowed_domains = ["nature.com/naturejobs/"]
start_urls = [
"http://www.nature.com/naturejobs/science/jobs?utf8=%E2%9C%93&q=pathologist&where=&commit=Find+Jobs"]
def parse_subpage(self, response):
il = response.meta['il']
il.add_xpath('loc_pj', '//div[@id="extranav"]/div/dl/dd[2]/ul/li/text()')
yield il.load_item()
def parse(self, response):
hxs = Selector(response)
sites = hxs.xpath('//div[@class="job-details"]')
for site in sites:
il = CAPjobsItemLoader(CAPjobsItem(), selector = site)
il.add_xpath('title', 'h3/a/text()')
il.add_xpath('post_date', 'normalize-space(ul/li[@class="when"]/text())')
il.add_xpath('web_url', 'concat("http://www.nature.com", h3/a/@href)')
url = il.get_output_value('web_url')
yield Request(url, meta={'il': il}, callback=self.parse_subpage)
现在抓取功能部分正常,但没有 loc_pj
项:(7 月 29 日更新,7:35pm)
2015-07-29 21:28:24 [scrapy] DEBUG: Scraped from <200 http://www.nature.com/naturejobs/science/jobs/535683-assistant-associate-full-hs-clinical-clin-x-anatomic-pathology-cytopathology-11-000>
{'post_date': u'21 days ago',
'title': u'Assistant, Associate, Full (HS Clinical, Clin X) - Anatomic Pathology/Cytopathology (11-000)',
'web_url': u'http://www.nature.com/naturejobs/science/jobs/535683-assistant-associate-full-hs-clinical-clin-x-anatomic-pathology-cytopathology-11-000'}
你像这样初始化 ItemLoader
:
il = CAPjobsItemLoader(CAPjobsItem, sites)
在documentation中是这样做的:
l = ItemLoader(item=Product(), response=response)
所以我认为你在 CAPjobsItem
处缺少括号,你的行应该是:
il = CAPjobsItemLoader(CAPjobsItem(), sites)