Scrapy Python 循环到下一个未被抓取的 link
Scrapy Python loop to next unscraped link
我正在尝试让我的蜘蛛浏览一个列表并抓取它可以找到的所有 url,然后抓取一些数据并返回继续下一个未抓取的 link 如果我 运行 蜘蛛我可以看到它 returns 回到了起始页面,但试图再次抓取同一页面,然后就退出了 python.[=14 非常新的任何代码建议=]
import scrapy
import re
from production.items import ProductionItem, ListResidentialItem
class productionSpider(scrapy.Spider):
name = "production"
allowed_domains = ["domain.com"]
start_urls = [
"http://domain.com/list"
]
def parse(self, response):
for sel in response.xpath('//html/body'):
item = ProductionItem()
item['listurl'] = sel.xpath('//a[@id="link101"]/@href').extract()[0]
request = scrapy.Request(item['listurl'], callback=self.parseBasicListingInfo)
yield request
def parseBasicListingInfo(item, response):
item = ListResidentialItem()
item['title'] = response.xpath('//span[@class="detail"]/text()').extract()
return item
澄清:
我传递 [0] 所以它只需要列表的第一个 link
但我希望它继续使用下一个未刮擦的 link
运行蜘蛛后的输出:
2016-07-18 12:11:20 [scrapy] DEBUG: Crawled (200) <GET http://www.domain.com/robots.txt> (referer: None)
2016-07-18 12:11:20 [scrapy] DEBUG: Crawled (200) <GET http://www.domain.com/list> (referer: None)
2016-07-18 12:11:21 [scrapy] DEBUG: Crawled (200) <GET http://www.domain.com/link1> (referer: http://www.domain.com/list)
2016-07-18 12:11:21 [scrapy] DEBUG: Scraped from <200 http://www.domain.com/link1>
{'title': [u'\rlink1\r']}
这应该可以正常工作。更改域和 xpath 并查看
import scrapy
import re
from scrapy.spiders import CrawlSpider, Rule
from scrapy.linkextractors import LinkExtractor
class ProdItems(scrapy.Item):
listurl = scrapy.Field()
title = scrapy.Field()
class productionSpider(scrapy.Spider):
name = "production"
allowed_domains = ["domain.com"]
start_urls = [
"http://domain.com/list"
]
def parse(self, response):
for sel in response.xpath('//html/body'):
item = ProductionItem()
list_urls = sel.xpath('//a[@id="link101"]/@href').extract()
for url in list_urls:
item['listurl'] = url
yield scrapy.Request(url, callback=self.parseBasicListingInfo, meta={'item': item})
def parseBasicListingInfo(item, response):
item = response.request.meta['item']
item['title'] = response.xpath('//span[@class="detail"]/text()').extract()
yield item
这是导致您出现问题的行:
item['listurl'] = sel.xpath('//a[@id="link101"]/@href').extract()[0]
“//”表示 "from the start of the document",这意味着它从第一个标签开始扫描,并且总是会在第一个 link 找到相同的标签。您需要做的是使用“.//”相对于当前标签的开头进行搜索,这意味着 "from this tag onwards"。此外,您当前的 for 循环正在访问文档中不必要的每个标签。试试这个:
def parse(self, response):
for href in response.xpath('//a[@id="link101"]/@href').extract():
item = ProductionItem()
item['listurl'] = href
yield scrapy.Request(href,callback=self.parseBasicListingInfo, meta={'item': item})
xpath 将 href 从 link 中提取出来,并将它们 returns 作为一个列表,您可以对其进行迭代。
我正在尝试让我的蜘蛛浏览一个列表并抓取它可以找到的所有 url,然后抓取一些数据并返回继续下一个未抓取的 link 如果我 运行 蜘蛛我可以看到它 returns 回到了起始页面,但试图再次抓取同一页面,然后就退出了 python.[=14 非常新的任何代码建议=]
import scrapy
import re
from production.items import ProductionItem, ListResidentialItem
class productionSpider(scrapy.Spider):
name = "production"
allowed_domains = ["domain.com"]
start_urls = [
"http://domain.com/list"
]
def parse(self, response):
for sel in response.xpath('//html/body'):
item = ProductionItem()
item['listurl'] = sel.xpath('//a[@id="link101"]/@href').extract()[0]
request = scrapy.Request(item['listurl'], callback=self.parseBasicListingInfo)
yield request
def parseBasicListingInfo(item, response):
item = ListResidentialItem()
item['title'] = response.xpath('//span[@class="detail"]/text()').extract()
return item
澄清: 我传递 [0] 所以它只需要列表的第一个 link 但我希望它继续使用下一个未刮擦的 link
运行蜘蛛后的输出:
2016-07-18 12:11:20 [scrapy] DEBUG: Crawled (200) <GET http://www.domain.com/robots.txt> (referer: None)
2016-07-18 12:11:20 [scrapy] DEBUG: Crawled (200) <GET http://www.domain.com/list> (referer: None)
2016-07-18 12:11:21 [scrapy] DEBUG: Crawled (200) <GET http://www.domain.com/link1> (referer: http://www.domain.com/list)
2016-07-18 12:11:21 [scrapy] DEBUG: Scraped from <200 http://www.domain.com/link1>
{'title': [u'\rlink1\r']}
这应该可以正常工作。更改域和 xpath 并查看
import scrapy
import re
from scrapy.spiders import CrawlSpider, Rule
from scrapy.linkextractors import LinkExtractor
class ProdItems(scrapy.Item):
listurl = scrapy.Field()
title = scrapy.Field()
class productionSpider(scrapy.Spider):
name = "production"
allowed_domains = ["domain.com"]
start_urls = [
"http://domain.com/list"
]
def parse(self, response):
for sel in response.xpath('//html/body'):
item = ProductionItem()
list_urls = sel.xpath('//a[@id="link101"]/@href').extract()
for url in list_urls:
item['listurl'] = url
yield scrapy.Request(url, callback=self.parseBasicListingInfo, meta={'item': item})
def parseBasicListingInfo(item, response):
item = response.request.meta['item']
item['title'] = response.xpath('//span[@class="detail"]/text()').extract()
yield item
这是导致您出现问题的行:
item['listurl'] = sel.xpath('//a[@id="link101"]/@href').extract()[0]
“//”表示 "from the start of the document",这意味着它从第一个标签开始扫描,并且总是会在第一个 link 找到相同的标签。您需要做的是使用“.//”相对于当前标签的开头进行搜索,这意味着 "from this tag onwards"。此外,您当前的 for 循环正在访问文档中不必要的每个标签。试试这个:
def parse(self, response):
for href in response.xpath('//a[@id="link101"]/@href').extract():
item = ProductionItem()
item['listurl'] = href
yield scrapy.Request(href,callback=self.parseBasicListingInfo, meta={'item': item})
xpath 将 href 从 link 中提取出来,并将它们 returns 作为一个列表,您可以对其进行迭代。