创建静态列项
Create a static column item
我有一个简单的蜘蛛,可以抓取本地讣告。在我尝试添加两个静态列之前,代码工作正常。我想要做的就是添加我提取信息的日期(pull item)和它被提取的状态(state item)。这是一个自加载页面,所以当我添加拉取日期时,我只会得到前 10 个结果(或只有第一页)。如果我只添加状态,我只会得到两个结果。当我删除两者时,我得到了所有 40 多个结果。
我做了 # 行工作不正常:
Item.py 文件:
import scrapy
class AlItem(scrapy.Item):
name = scrapy.Field()
link = scrapy.Field()
obit = scrapy.Field()
news = scrapy.Field()
#pull = scrapy.Field()
#state = scrapy.Field()
蜘蛛文件:
import scrapy
import time
from al.items import AlItem
class AlabamaSpider(scrapy.Spider):
name = 'alabama'
allowed_domains = ['legacy.com']
start_urls = ['http://www.legacy.com/obituaries/annistonstar/browse?type=paid&page=20']
def parse(self, response):
name = response.xpath('//a[@class="NonMobile"]/p[@class="obitName"]/text()').extract()
link = response.xpath('//div[@class="RightColumn"]//a[@class="ObituaryButton"]/@href').extract()
obit = response.xpath('//div[@class="NameAndLocation"]/p[@class="obitText"]/text()').extract()
news = response.xpath('//div[@class="PublishedLine publishedLine"]/span/text()').extract()
#pull = time.strftime("%m/%d/%Y")
#state = "AL"
for item in zip(name, link, obit, news): #removed 'pull, state'
new_item = AlItem()
new_item['name'] = item[0]
new_item['link'] = item[1]
new_item['obit'] = item[2]
new_item['news'] = item[3]
#new_item['pull'] = pull
#new_item["state"] = state
yield new_item
我解释一下原因:
如果您在此处粘贴 for item in zip(name, link, obit, news):
pull & state,那么您将获得等于 2 的迭代次数,因为 state = "AL"
- 字符串变量。 ZIP 函数从 state
两个字符获取并为循环中的所有参数设置迭代 = 2。 zip 从迭代参数中获取最小的数字。与日期 01/01/2001 一样 - 10 个字符。 (迭代是否等于 10)
将工作:
`class AlItem(scrapy.Item):
name = scrapy.Field()
link = scrapy.Field()
obit = scrapy.Field()
news = scrapy.Field()
pull = scrapy.Field()
state = scrapy.Field()`
class AlabamaSpider(scrapy.Spider):
name = 'alabama'
allowed_domains = ['legacy.com']
start_urls = ['http://www.legacy.com/obituaries/annistonstar/browsetype=paid&page=20']
def parse(self, response):
name = response.xpath('//a[@class="NonMobile"]/p[@class="obitName"]/text()').extract()
link = response.xpath('//div[@class="RightColumn"]//a[@class="ObituaryButton"]/@href').extract()
obit = response.xpath('//div[@class="NameAndLocation"]/p[@class="obitText"]/text()').extract()
news = response.xpath('//div[@class="PublishedLine publishedLine"]/span/text()').extract()
pull = time.strftime("%m/%d/%Y")
state = "AL"
for item in zip(name, link, obit, news): #removed 'pull, state'
new_item = AlItem()
new_item['name'] = item[0]
new_item['link'] = item[1]
new_item['obit'] = item[2]
new_item['news'] = item[3]
new_item['pull'] = pull
new_item["state"] = state
yield new_item
我有一个简单的蜘蛛,可以抓取本地讣告。在我尝试添加两个静态列之前,代码工作正常。我想要做的就是添加我提取信息的日期(pull item)和它被提取的状态(state item)。这是一个自加载页面,所以当我添加拉取日期时,我只会得到前 10 个结果(或只有第一页)。如果我只添加状态,我只会得到两个结果。当我删除两者时,我得到了所有 40 多个结果。
我做了 # 行工作不正常:
Item.py 文件:
import scrapy
class AlItem(scrapy.Item):
name = scrapy.Field()
link = scrapy.Field()
obit = scrapy.Field()
news = scrapy.Field()
#pull = scrapy.Field()
#state = scrapy.Field()
蜘蛛文件:
import scrapy
import time
from al.items import AlItem
class AlabamaSpider(scrapy.Spider):
name = 'alabama'
allowed_domains = ['legacy.com']
start_urls = ['http://www.legacy.com/obituaries/annistonstar/browse?type=paid&page=20']
def parse(self, response):
name = response.xpath('//a[@class="NonMobile"]/p[@class="obitName"]/text()').extract()
link = response.xpath('//div[@class="RightColumn"]//a[@class="ObituaryButton"]/@href').extract()
obit = response.xpath('//div[@class="NameAndLocation"]/p[@class="obitText"]/text()').extract()
news = response.xpath('//div[@class="PublishedLine publishedLine"]/span/text()').extract()
#pull = time.strftime("%m/%d/%Y")
#state = "AL"
for item in zip(name, link, obit, news): #removed 'pull, state'
new_item = AlItem()
new_item['name'] = item[0]
new_item['link'] = item[1]
new_item['obit'] = item[2]
new_item['news'] = item[3]
#new_item['pull'] = pull
#new_item["state"] = state
yield new_item
我解释一下原因:
如果您在此处粘贴
for item in zip(name, link, obit, news):
pull & state,那么您将获得等于 2 的迭代次数,因为state = "AL"
- 字符串变量。 ZIP 函数从state
两个字符获取并为循环中的所有参数设置迭代 = 2。 zip 从迭代参数中获取最小的数字。与日期 01/01/2001 一样 - 10 个字符。 (迭代是否等于 10)
将工作:
`class AlItem(scrapy.Item):
name = scrapy.Field()
link = scrapy.Field()
obit = scrapy.Field()
news = scrapy.Field()
pull = scrapy.Field()
state = scrapy.Field()`
class AlabamaSpider(scrapy.Spider):
name = 'alabama'
allowed_domains = ['legacy.com']
start_urls = ['http://www.legacy.com/obituaries/annistonstar/browsetype=paid&page=20']
def parse(self, response):
name = response.xpath('//a[@class="NonMobile"]/p[@class="obitName"]/text()').extract()
link = response.xpath('//div[@class="RightColumn"]//a[@class="ObituaryButton"]/@href').extract()
obit = response.xpath('//div[@class="NameAndLocation"]/p[@class="obitText"]/text()').extract()
news = response.xpath('//div[@class="PublishedLine publishedLine"]/span/text()').extract()
pull = time.strftime("%m/%d/%Y")
state = "AL"
for item in zip(name, link, obit, news): #removed 'pull, state'
new_item = AlItem()
new_item['name'] = item[0]
new_item['link'] = item[1]
new_item['obit'] = item[2]
new_item['news'] = item[3]
new_item['pull'] = pull
new_item["state"] = state
yield new_item