Scrapy 爬虫不处理 XHR 请求
Scrapy crawler not processing XHR Request
我的蜘蛛只抓取前 10 页,所以我假设它没有通过请求进入加载更多按钮。
我正在抓取这个网站:http://www.t3.com/reviews。
我的爬虫代码:
import scrapy
from scrapy.conf import settings
from scrapy.http import Request
from scrapy.selector import Selector
from reviews.items import ReviewItem
class T3Spider(scrapy.Spider):
name = "t3" #spider name to call in terminal
allowed_domains = ['t3.com'] #the domain where the spider is allowed to crawl
start_urls = ['http://www.t3.com/reviews'] #url from which the spider will start crawling
def parse(self, response):
sel = Selector(response)
review_links = sel.xpath('//div[@id="content"]//div/div/a/@href').extract()
for link in review_links:
yield Request(url="http://www.t3.com"+link, callback=self.parse_review)
#if there is a load-more button:
if sel.xpath('//*[@class="load-more"]'):
req = Request(url=r'http://www\.t3\.com/more/reviews/latest/\d+', headers = {"Referer": "http://www.t3.com/reviews", "X-Requested-With": "XMLHttpRequest"}, callback=self.parse)
yield req
else:
return
def parse_review(self, response):
pass #all my scraped item fields
我做错了什么?抱歉,我对 scrapy 还很陌生。感谢您的时间、耐心和帮助。
如果您检查 "Load More" 按钮,您将找不到任何关于如何构建用于加载更多评论的 link 的指示。背后的想法很简单——http://www.t3.com/more/reviews/latest/
之后的数字看起来像是上次加载文章的时间戳。获取方式如下:
import calendar
from dateutil.parser import parse
import scrapy
from scrapy.http import Request
class T3Spider(scrapy.Spider):
name = "t3"
allowed_domains = ['t3.com']
start_urls = ['http://www.t3.com/reviews']
def parse(self, response):
reviews = response.css('div.listingResult')
for review in reviews:
link = review.xpath("a/@href").extract()[0]
yield Request(url="http://www.t3.com" + link, callback=self.parse_review)
# TODO: handle exceptions here
# extract the review date
time = reviews[-1].xpath(".//time/@datetime").extract()[0]
# convert a date into a timestamp
timestamp = calendar.timegm(parse(time).timetuple())
url = 'http://www.t3.com/more/reviews/latest/%d' % timestamp
req = Request(url=url,
headers={"Referer": "http://www.t3.com/reviews", "X-Requested-With": "XMLHttpRequest"},
callback=self.parse)
yield req
def parse_review(self, response):
print response.url
备注:
- 这需要安装
dateutil
模块
- 您应该重新检查代码并确保您获得了所有评论而没有跳过任何评论
- 你应该以某种方式结束这 "Load more" 事情
我的蜘蛛只抓取前 10 页,所以我假设它没有通过请求进入加载更多按钮。
我正在抓取这个网站:http://www.t3.com/reviews。
我的爬虫代码:
import scrapy
from scrapy.conf import settings
from scrapy.http import Request
from scrapy.selector import Selector
from reviews.items import ReviewItem
class T3Spider(scrapy.Spider):
name = "t3" #spider name to call in terminal
allowed_domains = ['t3.com'] #the domain where the spider is allowed to crawl
start_urls = ['http://www.t3.com/reviews'] #url from which the spider will start crawling
def parse(self, response):
sel = Selector(response)
review_links = sel.xpath('//div[@id="content"]//div/div/a/@href').extract()
for link in review_links:
yield Request(url="http://www.t3.com"+link, callback=self.parse_review)
#if there is a load-more button:
if sel.xpath('//*[@class="load-more"]'):
req = Request(url=r'http://www\.t3\.com/more/reviews/latest/\d+', headers = {"Referer": "http://www.t3.com/reviews", "X-Requested-With": "XMLHttpRequest"}, callback=self.parse)
yield req
else:
return
def parse_review(self, response):
pass #all my scraped item fields
我做错了什么?抱歉,我对 scrapy 还很陌生。感谢您的时间、耐心和帮助。
如果您检查 "Load More" 按钮,您将找不到任何关于如何构建用于加载更多评论的 link 的指示。背后的想法很简单——http://www.t3.com/more/reviews/latest/
之后的数字看起来像是上次加载文章的时间戳。获取方式如下:
import calendar
from dateutil.parser import parse
import scrapy
from scrapy.http import Request
class T3Spider(scrapy.Spider):
name = "t3"
allowed_domains = ['t3.com']
start_urls = ['http://www.t3.com/reviews']
def parse(self, response):
reviews = response.css('div.listingResult')
for review in reviews:
link = review.xpath("a/@href").extract()[0]
yield Request(url="http://www.t3.com" + link, callback=self.parse_review)
# TODO: handle exceptions here
# extract the review date
time = reviews[-1].xpath(".//time/@datetime").extract()[0]
# convert a date into a timestamp
timestamp = calendar.timegm(parse(time).timetuple())
url = 'http://www.t3.com/more/reviews/latest/%d' % timestamp
req = Request(url=url,
headers={"Referer": "http://www.t3.com/reviews", "X-Requested-With": "XMLHttpRequest"},
callback=self.parse)
yield req
def parse_review(self, response):
print response.url
备注:
- 这需要安装
dateutil
模块 - 您应该重新检查代码并确保您获得了所有评论而没有跳过任何评论
- 你应该以某种方式结束这 "Load more" 事情