坚持使用来自 subreddits 的 imgur 链接的 scrapy

Question

我正在抓取 reddit 以获取 subreddit 中每个条目的 link。我也想关注匹配 http://imgur.com/gallery/\w* 的 link。但是我在运行 Imgur 的回调中遇到了问题。它只是不执行它。什么失败了？

我正在用一个简单的 if "http://imgur.com/gallery/" in item['link'][0]: 语句检测 Imgur url，也许 scrapy 提供了更好的检测方法？

这是我试过的：

import scrapy
from scrapy.spiders import CrawlSpider, Rule
from scrapy.linkextractors import LinkExtractor

from reddit.items import RedditItem


class RedditSpider(CrawlSpider):
    name = "reddit"
    allowed_domains = ["reddit.com"]
    start_urls = [
        "http://www.reddit.com/r/pics",
    ]

    rules = [
        Rule(
            LinkExtractor(allow=['/r/pics/\?count=\d.*&after=\w.*']),
            callback='parse_item',
            follow=True
        )
    ]

    def parse_item(self, response):
        for title in response.xpath("//div[contains(@class, 'entry')]/p/a"):
            item = RedditItem()
            item['title'] = title.xpath('text()').extract()
            item['link'] = title.xpath('@href').extract()

            yield item

            if "http://imgur.com/gallery/" in item['link'][0]:
                # print item['link'][0]
                url = response.urljoin(item['link'][0])
                print url
                yield scrapy.Request(url, callback=self.parse_imgur_gallery)

    def parse_imgur_gallery(self, response):
        print response.url

这是我的商品class:

import scrapy


class RedditItem(scrapy.Item):
    title = scrapy.Field()
    link = scrapy.Field()

这是用 --nolog 执行爬虫并在 if 条件下打印 url 变量时的输出（它不是 response.url var 输出），它仍然没有运行回调：

PS C:\repos\python\scrapy\reddit> scrapy crawl --output=export.json --nolog reddit
http://imgur.com/gallery/W7sXs/new
http://imgur.com/gallery/v26KnSX
http://imgur.com/gallery/fqqBq
http://imgur.com/gallery/9GDTP/new
http://imgur.com/gallery/5gjLCPV
http://imgur.com/gallery/l6Tpavl
http://imgur.com/gallery/Ow4gQ
...

Answer 1

我找到了。不允许使用 imgur.com 域。只需要添加它...

allowed_domains = ["reddit.com", "imgur.com"]

坚持使用来自 subreddits 的 imgur 链接的 scrapy

Stuck with scrapy following imgur links from subreddits

python

screen-scraping

scrapy

scrapy-spider