坚持使用来自 subreddits 的 imgur 链接的 scrapy
Stuck with scrapy following imgur links from subreddits
我正在抓取 reddit 以获取 subreddit 中每个条目的 link。我也想关注匹配 http://imgur.com/gallery/\w*
的 link。但是我在 运行 Imgur 的回调中遇到了问题。它只是不执行它。什么失败了?
我正在用一个简单的 if "http://imgur.com/gallery/" in item['link'][0]:
语句检测 Imgur url,也许 scrapy 提供了更好的检测方法?
这是我试过的:
import scrapy
from scrapy.spiders import CrawlSpider, Rule
from scrapy.linkextractors import LinkExtractor
from reddit.items import RedditItem
class RedditSpider(CrawlSpider):
name = "reddit"
allowed_domains = ["reddit.com"]
start_urls = [
"http://www.reddit.com/r/pics",
]
rules = [
Rule(
LinkExtractor(allow=['/r/pics/\?count=\d.*&after=\w.*']),
callback='parse_item',
follow=True
)
]
def parse_item(self, response):
for title in response.xpath("//div[contains(@class, 'entry')]/p/a"):
item = RedditItem()
item['title'] = title.xpath('text()').extract()
item['link'] = title.xpath('@href').extract()
yield item
if "http://imgur.com/gallery/" in item['link'][0]:
# print item['link'][0]
url = response.urljoin(item['link'][0])
print url
yield scrapy.Request(url, callback=self.parse_imgur_gallery)
def parse_imgur_gallery(self, response):
print response.url
这是我的商品class:
import scrapy
class RedditItem(scrapy.Item):
title = scrapy.Field()
link = scrapy.Field()
这是用 --nolog
执行爬虫并在 if 条件下打印 url
变量时的输出(它不是 response.url
var 输出),它仍然没有运行回调:
PS C:\repos\python\scrapy\reddit> scrapy crawl --output=export.json --nolog reddit
http://imgur.com/gallery/W7sXs/new
http://imgur.com/gallery/v26KnSX
http://imgur.com/gallery/fqqBq
http://imgur.com/gallery/9GDTP/new
http://imgur.com/gallery/5gjLCPV
http://imgur.com/gallery/l6Tpavl
http://imgur.com/gallery/Ow4gQ
...
我找到了。不允许使用 imgur.com
域。只需要添加它...
allowed_domains = ["reddit.com", "imgur.com"]
我正在抓取 reddit 以获取 subreddit 中每个条目的 link。我也想关注匹配 http://imgur.com/gallery/\w*
的 link。但是我在 运行 Imgur 的回调中遇到了问题。它只是不执行它。什么失败了?
我正在用一个简单的 if "http://imgur.com/gallery/" in item['link'][0]:
语句检测 Imgur url,也许 scrapy 提供了更好的检测方法?
这是我试过的:
import scrapy
from scrapy.spiders import CrawlSpider, Rule
from scrapy.linkextractors import LinkExtractor
from reddit.items import RedditItem
class RedditSpider(CrawlSpider):
name = "reddit"
allowed_domains = ["reddit.com"]
start_urls = [
"http://www.reddit.com/r/pics",
]
rules = [
Rule(
LinkExtractor(allow=['/r/pics/\?count=\d.*&after=\w.*']),
callback='parse_item',
follow=True
)
]
def parse_item(self, response):
for title in response.xpath("//div[contains(@class, 'entry')]/p/a"):
item = RedditItem()
item['title'] = title.xpath('text()').extract()
item['link'] = title.xpath('@href').extract()
yield item
if "http://imgur.com/gallery/" in item['link'][0]:
# print item['link'][0]
url = response.urljoin(item['link'][0])
print url
yield scrapy.Request(url, callback=self.parse_imgur_gallery)
def parse_imgur_gallery(self, response):
print response.url
这是我的商品class:
import scrapy
class RedditItem(scrapy.Item):
title = scrapy.Field()
link = scrapy.Field()
这是用 --nolog
执行爬虫并在 if 条件下打印 url
变量时的输出(它不是 response.url
var 输出),它仍然没有运行回调:
PS C:\repos\python\scrapy\reddit> scrapy crawl --output=export.json --nolog reddit
http://imgur.com/gallery/W7sXs/new
http://imgur.com/gallery/v26KnSX
http://imgur.com/gallery/fqqBq
http://imgur.com/gallery/9GDTP/new
http://imgur.com/gallery/5gjLCPV
http://imgur.com/gallery/l6Tpavl
http://imgur.com/gallery/Ow4gQ
...
我找到了。不允许使用 imgur.com
域。只需要添加它...
allowed_domains = ["reddit.com", "imgur.com"]