Scrapy 抓取提取的链接
Scrapy crawl extracted links
我需要抓取一个网站,并在特定的 xpath 上抓取该网站的每个 url
例如。:
我需要抓取“http://someurl.com/world/" which has 10 links in the container (xpath("//div[@class='pane-content']")) and i need to crawl all those 10 links and extract images from them, but the links in "http://someurl.com/world/”的样子
“http://someurl.com/node/xxxx”
我目前拥有的:
import scrapy
from scrapy.contrib.spiders import Rule, CrawlSpider
from scrapy.contrib.linkextractors import LinkExtractor
from imgur.items import ImgurItem
class ImgurSpider(CrawlSpider):
name = 'imgur'
allowed_domains = ['someurl.com/']
start_urls = ['http://someurl.com/news']
rules = [Rule(LinkExtractor(allow=('/node/.*')), callback='parse_imgur', follow=True)]
def parse_imgur(self, response):
image = ImgurItem()
image['title'] = response.xpath(\
"//h1[@class='pane-content']/a/text()").extract()
rel = response.xpath("//img/@src").extract()
image['image_urls'] = response.xpath("//img/@src").extract()
return image
您可以重写 'Rule' 以满足您的所有要求:
rules = [Rule(LinkExtractor(allow=('/node/.*',), restrict_xpaths=('//div[@class="pane-content"]',)), callback='parse_imgur', follow=True)]
要从提取的图片链接中下载图片,您可以使用 Scrapy 的捆绑包 ImagePipeline
我需要抓取一个网站,并在特定的 xpath 上抓取该网站的每个 url 例如。: 我需要抓取“http://someurl.com/world/" which has 10 links in the container (xpath("//div[@class='pane-content']")) and i need to crawl all those 10 links and extract images from them, but the links in "http://someurl.com/world/”的样子 “http://someurl.com/node/xxxx”
我目前拥有的:
import scrapy
from scrapy.contrib.spiders import Rule, CrawlSpider
from scrapy.contrib.linkextractors import LinkExtractor
from imgur.items import ImgurItem
class ImgurSpider(CrawlSpider):
name = 'imgur'
allowed_domains = ['someurl.com/']
start_urls = ['http://someurl.com/news']
rules = [Rule(LinkExtractor(allow=('/node/.*')), callback='parse_imgur', follow=True)]
def parse_imgur(self, response):
image = ImgurItem()
image['title'] = response.xpath(\
"//h1[@class='pane-content']/a/text()").extract()
rel = response.xpath("//img/@src").extract()
image['image_urls'] = response.xpath("//img/@src").extract()
return image
您可以重写 'Rule' 以满足您的所有要求:
rules = [Rule(LinkExtractor(allow=('/node/.*',), restrict_xpaths=('//div[@class="pane-content"]',)), callback='parse_imgur', follow=True)]
要从提取的图片链接中下载图片,您可以使用 Scrapy 的捆绑包 ImagePipeline