Scrapy 抓取提取的链接
Scrapy crawl extracted links
我需要抓取一个网站,并在特定的 xpath 上抓取该网站的每个 url
我需要抓取“" which has 10 links in the container (xpath("//div[@class='pane-content']")) and i need to crawl all those 10 links and extract images from them, but the links in "”的样子
import scrapy
from scrapy.contrib.spiders import Rule, CrawlSpider
from scrapy.contrib.linkextractors import LinkExtractor
from imgur.items import ImgurItem
class ImgurSpider(CrawlSpider):
name = 'imgur'
allowed_domains = ['']
start_urls = ['']
rules = [Rule(LinkExtractor(allow=('/node/.*')), callback='parse_imgur', follow=True)]
def parse_imgur(self, response):
image = ImgurItem()
image['title'] = response.xpath(\
rel = response.xpath("//img/@src").extract()
image['image_urls'] = response.xpath("//img/@src").extract()
return image
您可以重写 'Rule' 以满足您的所有要求:
rules = [Rule(LinkExtractor(allow=('/node/.*',), restrict_xpaths=('//div[@class="pane-content"]',)), callback='parse_imgur', follow=True)]
要从提取的图片链接中下载图片,您可以使用 Scrapy 的捆绑包 ImagePipeline
我需要抓取一个网站,并在特定的 xpath 上抓取该网站的每个 url 例如。: 我需要抓取“" which has 10 links in the container (xpath("//div[@class='pane-content']")) and i need to crawl all those 10 links and extract images from them, but the links in "”的样子 “”
import scrapy
from scrapy.contrib.spiders import Rule, CrawlSpider
from scrapy.contrib.linkextractors import LinkExtractor
from imgur.items import ImgurItem
class ImgurSpider(CrawlSpider):
name = 'imgur'
allowed_domains = ['']
start_urls = ['']
rules = [Rule(LinkExtractor(allow=('/node/.*')), callback='parse_imgur', follow=True)]
def parse_imgur(self, response):
image = ImgurItem()
image['title'] = response.xpath(\
rel = response.xpath("//img/@src").extract()
image['image_urls'] = response.xpath("//img/@src").extract()
return image
您可以重写 'Rule' 以满足您的所有要求:
rules = [Rule(LinkExtractor(allow=('/node/.*',), restrict_xpaths=('//div[@class="pane-content"]',)), callback='parse_imgur', follow=True)]
要从提取的图片链接中下载图片,您可以使用 Scrapy 的捆绑包 ImagePipeline