IMDB 网络爬虫 - Scrapy - Python
IMDB web crawler - Scrapy - Python
import scrapy
from imdbscrape.items import MovieItem
class MovieSpider(scrapy.Spider):
name = 'movie'
allowed_domains = ['imdb.com']
start_urls = ['https://www.imdb.com/search/title?year=2017,2018&title_type=feature&sort=moviemeter,asc']
def parse(self, response):
urls = response.css('h3.lister-item-header > a::attr(href)').extract()
for url in urls:
yield scrapy.Request(url=response.urljoin(url),callback=self.parse_movie)
nextpg = response.css('div.desc > a::attr(href)').extract_first()
if nextpg:
nextpg = response.urljoin(nextpg)
yield scrapy.Request(url=nextpg,callback=self.parse)
def parse_movie(self, response):
item = MovieItem()
item['title'] = self.getTitle(response)
item['year'] = self.getYear(response)
item['rating'] = self.getRating(response)
item['genre'] = self.getGenre(response)
item['director'] = self.getDirector(response)
item['summary'] = self.getSummary(response)
item['actors'] = self.getActors(response)
yield item
我写了上面的代码来抓取从 2017 年至今的所有 imdb 电影。但是这段代码只能抓取 100 部电影。请帮助。
我认为问题出在
nextpg = response.css('div.desc > a::attr(href)').extract_first()
在此页面上
https://www.imdb.com/search/title?year=2017,2018&title_type=feature&sort=moviemeter,asc
下一页的代码link是这个
<div class="desc">
<span class="lister-current-first-item">1</span> to
<span class="lister-current-last-item">50</span> of 24,842 titles
<span class="ghost">|</span>
<a href="?year=2017,2018&title_type=feature&sort=moviemeter,asc&page=2&ref_=adv_nxt" class="lister-page-next next-page" ref-marker="adv_nxt">Next »</a>
</div>
您的代码使用锚文本 Next >>
获取 link 的 href
这是哪个
您转到该页面并抓取接下来的 50 部电影
然而 div 中的 html 和 class of desc 中有两个 link。不像第一页。
第一个link是前一个link,不是下一个link。
<div class="desc">
<span class="lister-current-first-item">51</span> to
<span class="lister-current-last-item">100</span> of 24,842 titles
<span class="ghost">|</span> <a href="?year=2017,2018&title_type=feature&sort=moviemeter,asc&page=1&ref_=adv_prv" class="lister-page-prev prev-page" ref-marker="adv_nxt">« Previous</a>
<span class="ghost">|</span> <a href="?year=2017,2018&title_type=feature&sort=moviemeter,asc&page=3&ref_=adv_nxt" class="lister-page-next next-page" ref-marker="adv_nxt">Next »</a>
</div>
我要做的是将计数器设置为 0。
成功抓取后增加。
如果计数器大于 0,则获取第二个 link 并转到那个 link 并在那个页面上抓取结果
如果计数器不大于 0,则抓取第一个 link 并转到那个并抓取该页面上的结果
import scrapy
from imdbscrape.items import MovieItem
class MovieSpider(scrapy.Spider):
name = 'movie'
allowed_domains = ['imdb.com']
start_urls = ['https://www.imdb.com/search/title?year=2017,2018&title_type=feature&sort=moviemeter,asc']
def parse(self, response):
urls = response.css('h3.lister-item-header > a::attr(href)').extract()
for url in urls:
yield scrapy.Request(url=response.urljoin(url),callback=self.parse_movie)
nextpg = response.css('div.desc > a::attr(href)').extract_first()
if nextpg:
nextpg = response.urljoin(nextpg)
yield scrapy.Request(url=nextpg,callback=self.parse)
def parse_movie(self, response):
item = MovieItem()
item['title'] = self.getTitle(response)
item['year'] = self.getYear(response)
item['rating'] = self.getRating(response)
item['genre'] = self.getGenre(response)
item['director'] = self.getDirector(response)
item['summary'] = self.getSummary(response)
item['actors'] = self.getActors(response)
yield item
我写了上面的代码来抓取从 2017 年至今的所有 imdb 电影。但是这段代码只能抓取 100 部电影。请帮助。
我认为问题出在
nextpg = response.css('div.desc > a::attr(href)').extract_first()
在此页面上 https://www.imdb.com/search/title?year=2017,2018&title_type=feature&sort=moviemeter,asc
下一页的代码link是这个
<div class="desc">
<span class="lister-current-first-item">1</span> to
<span class="lister-current-last-item">50</span> of 24,842 titles
<span class="ghost">|</span>
<a href="?year=2017,2018&title_type=feature&sort=moviemeter,asc&page=2&ref_=adv_nxt" class="lister-page-next next-page" ref-marker="adv_nxt">Next »</a>
</div>
您的代码使用锚文本 Next >>
获取 link 的 href这是哪个
您转到该页面并抓取接下来的 50 部电影
然而 div 中的 html 和 class of desc 中有两个 link。不像第一页。
第一个link是前一个link,不是下一个link。
<div class="desc">
<span class="lister-current-first-item">51</span> to
<span class="lister-current-last-item">100</span> of 24,842 titles
<span class="ghost">|</span> <a href="?year=2017,2018&title_type=feature&sort=moviemeter,asc&page=1&ref_=adv_prv" class="lister-page-prev prev-page" ref-marker="adv_nxt">« Previous</a>
<span class="ghost">|</span> <a href="?year=2017,2018&title_type=feature&sort=moviemeter,asc&page=3&ref_=adv_nxt" class="lister-page-next next-page" ref-marker="adv_nxt">Next »</a>
</div>
我要做的是将计数器设置为 0。
成功抓取后增加。
如果计数器大于 0,则获取第二个 link 并转到那个 link 并在那个页面上抓取结果
如果计数器不大于 0,则抓取第一个 link 并转到那个并抓取该页面上的结果