IMDB scrapy 获取所有电影数据
IMDB scrapy get all movie data
我正在从事一个 class 项目,并试图获取 2016 年之前的所有 IMDB 电影数据(标题、预算等)。我采用了 https://github.com/alexwhb/IMDB-spider/blob/master/tutorial/spiders/spider.py 中的代码。
我的想法是:从i in range(1874,2016)开始(因为1874是http://www.imdb.com/year/显示的最早的年份),将程序指向对应年份的网站,从那个url。
但问题是,每年每个页面只显示50部电影,所以爬取50部电影后,如何进入下一页?在每年爬行之后,我如何才能进入下一年?到目前为止,这是我解析 url 部分的代码,但它只能抓取特定年份的 50 部电影。
class tutorialSpider(scrapy.Spider):
name = "tutorial"
allowed_domains = ["imdb.com"]
start_urls = ["http://www.imdb.com/search/title?year=2014,2014&title_type=feature&sort=moviemeter,asc"]
def parse(self, response):
for sel in response.xpath("//*[@class='results']/tr/td[3]"):
item = MovieItem()
item['Title'] = sel.xpath('a/text()').extract()[0]
item['MianPageUrl']= "http://imdb.com"+sel.xpath('a/@href').extract()[0]
request = scrapy.Request(item['MianPageUrl'], callback=self.parseMovieDetails)
request.meta['item'] = item
yield request
我想出了一个非常愚蠢的方法来解决这个问题。我把所有的链接都放在start_urls里了。更好的解决方案将不胜感激!
class tutorialSpider(scrapy.Spider):
name = "tutorial"
allowed_domains = ["imdb.com"]
start_urls = []
for i in xrange(1874, 2017):
for j in xrange(1, 11501, 50):
# since the largest number of movies for a year to have is 11,400 (2016)
start_url = "http://www.imdb.com/search/title?sort=moviemeter,asc&start=" + str(j) + "&title_type=feature&year=" + str(i) + "," + str(i)
start_urls.append(start_url)
def parse(self, response):
for sel in response.xpath("//*[@class='results']/tr/td[3]"):
item = MovieItem()
item['Title'] = sel.xpath('a/text()').extract()[0]
item['MianPageUrl']= "http://imdb.com"+sel.xpath('a/@href').extract()[0]
request = scrapy.Request(item['MianPageUrl'], callback=self.parseMovieDetails)
request.meta['item'] = item
yield request
您可以使用 CrawlSpiders 来简化您的任务。正如您将在下面看到的,start_requests
动态生成 URL 列表,而 parse_page
仅提取要抓取的电影。查找和跟踪 'Next' link 是由 rules
属性完成的。
我同意@Padraic Cunningham 的观点,即硬编码值不是一个好主意。我添加了蜘蛛参数,以便您可以调用:
scrapy crawl imdb -a start=1950 -a end=1980
(如果没有获取任何参数,scraper 将默认为 1874-2016)。
import scrapy
from scrapy.spiders import CrawlSpider, Rule
from scrapy.linkextractors import LinkExtractor
from imdbyear.items import MovieItem
class IMDBSpider(CrawlSpider):
name = 'imdb'
rules = (
# extract links at the bottom of the page. note that there are 'Prev' and 'Next'
# links, so a bit of additional filtering is needed
Rule(LinkExtractor(restrict_xpaths=('//*[@id="right"]/span/a')),
process_links=lambda links: filter(lambda l: 'Next' in l.text, links),
callback='parse_page',
follow=True),
)
def __init__(self, start=None, end=None, *args, **kwargs):
super(IMDBSpider, self).__init__(*args, **kwargs)
self.start_year = int(start) if start else 1874
self.end_year = int(end) if end else 2016
# generate start_urls dynamically
def start_requests(self):
for year in range(self.start_year, self.end_year+1):
yield scrapy.Request('http://www.imdb.com/search/title?year=%d,%d&title_type=feature&sort=moviemeter,asc' % (year, year))
def parse_page(self, response):
for sel in response.xpath("//*[@class='results']/tr/td[3]"):
item = MovieItem()
item['Title'] = sel.xpath('a/text()').extract()[0]
# note -- you had 'MianPageUrl' as your scrapy field name. I would recommend fixing this typo
# (you will need to change it in items.py as well)
item['MainPageUrl']= "http://imdb.com"+sel.xpath('a/@href').extract()[0]
request = scrapy.Request(item['MainPageUrl'], callback=self.parseMovieDetails)
request.meta['item'] = item
yield request
# make sure that the dynamically generated start_urls are parsed as well
parse_start_url = parse_page
# do your magic
def parseMovieDetails(self, response):
pass
you can use the below piece of code to follow the next page
#'a.lister-page-next.next-page::attr(href)' is the selector to get the next page link
next_page = response.css('a.lister-page-next.nextpage::attr(href)').extract_first() # joins current and next page url
if next_page is not None:
next_page = response.urljoin(next_page)
yield scrapy.Request(next_page, callback=self.parse) # calls parse function again when crawled to next page
@Greg Sadetsky 提供的代码需要一些小改动。好吧,只有 parse_page 方法第一行的一个变化。
Just change xpath in the for loop from:
response.xpath("//*[@class='results']/tr/td[3]"):
to
response.xpath("//*[contains(@class,'lister-item-content')]/h3"):
这对我来说就像一个魅力!
我正在从事一个 class 项目,并试图获取 2016 年之前的所有 IMDB 电影数据(标题、预算等)。我采用了 https://github.com/alexwhb/IMDB-spider/blob/master/tutorial/spiders/spider.py 中的代码。
我的想法是:从i in range(1874,2016)开始(因为1874是http://www.imdb.com/year/显示的最早的年份),将程序指向对应年份的网站,从那个url。
但问题是,每年每个页面只显示50部电影,所以爬取50部电影后,如何进入下一页?在每年爬行之后,我如何才能进入下一年?到目前为止,这是我解析 url 部分的代码,但它只能抓取特定年份的 50 部电影。
class tutorialSpider(scrapy.Spider):
name = "tutorial"
allowed_domains = ["imdb.com"]
start_urls = ["http://www.imdb.com/search/title?year=2014,2014&title_type=feature&sort=moviemeter,asc"]
def parse(self, response):
for sel in response.xpath("//*[@class='results']/tr/td[3]"):
item = MovieItem()
item['Title'] = sel.xpath('a/text()').extract()[0]
item['MianPageUrl']= "http://imdb.com"+sel.xpath('a/@href').extract()[0]
request = scrapy.Request(item['MianPageUrl'], callback=self.parseMovieDetails)
request.meta['item'] = item
yield request
我想出了一个非常愚蠢的方法来解决这个问题。我把所有的链接都放在start_urls里了。更好的解决方案将不胜感激!
class tutorialSpider(scrapy.Spider):
name = "tutorial"
allowed_domains = ["imdb.com"]
start_urls = []
for i in xrange(1874, 2017):
for j in xrange(1, 11501, 50):
# since the largest number of movies for a year to have is 11,400 (2016)
start_url = "http://www.imdb.com/search/title?sort=moviemeter,asc&start=" + str(j) + "&title_type=feature&year=" + str(i) + "," + str(i)
start_urls.append(start_url)
def parse(self, response):
for sel in response.xpath("//*[@class='results']/tr/td[3]"):
item = MovieItem()
item['Title'] = sel.xpath('a/text()').extract()[0]
item['MianPageUrl']= "http://imdb.com"+sel.xpath('a/@href').extract()[0]
request = scrapy.Request(item['MianPageUrl'], callback=self.parseMovieDetails)
request.meta['item'] = item
yield request
您可以使用 CrawlSpiders 来简化您的任务。正如您将在下面看到的,start_requests
动态生成 URL 列表,而 parse_page
仅提取要抓取的电影。查找和跟踪 'Next' link 是由 rules
属性完成的。
我同意@Padraic Cunningham 的观点,即硬编码值不是一个好主意。我添加了蜘蛛参数,以便您可以调用:
scrapy crawl imdb -a start=1950 -a end=1980
(如果没有获取任何参数,scraper 将默认为 1874-2016)。
import scrapy
from scrapy.spiders import CrawlSpider, Rule
from scrapy.linkextractors import LinkExtractor
from imdbyear.items import MovieItem
class IMDBSpider(CrawlSpider):
name = 'imdb'
rules = (
# extract links at the bottom of the page. note that there are 'Prev' and 'Next'
# links, so a bit of additional filtering is needed
Rule(LinkExtractor(restrict_xpaths=('//*[@id="right"]/span/a')),
process_links=lambda links: filter(lambda l: 'Next' in l.text, links),
callback='parse_page',
follow=True),
)
def __init__(self, start=None, end=None, *args, **kwargs):
super(IMDBSpider, self).__init__(*args, **kwargs)
self.start_year = int(start) if start else 1874
self.end_year = int(end) if end else 2016
# generate start_urls dynamically
def start_requests(self):
for year in range(self.start_year, self.end_year+1):
yield scrapy.Request('http://www.imdb.com/search/title?year=%d,%d&title_type=feature&sort=moviemeter,asc' % (year, year))
def parse_page(self, response):
for sel in response.xpath("//*[@class='results']/tr/td[3]"):
item = MovieItem()
item['Title'] = sel.xpath('a/text()').extract()[0]
# note -- you had 'MianPageUrl' as your scrapy field name. I would recommend fixing this typo
# (you will need to change it in items.py as well)
item['MainPageUrl']= "http://imdb.com"+sel.xpath('a/@href').extract()[0]
request = scrapy.Request(item['MainPageUrl'], callback=self.parseMovieDetails)
request.meta['item'] = item
yield request
# make sure that the dynamically generated start_urls are parsed as well
parse_start_url = parse_page
# do your magic
def parseMovieDetails(self, response):
pass
you can use the below piece of code to follow the next page
#'a.lister-page-next.next-page::attr(href)' is the selector to get the next page link
next_page = response.css('a.lister-page-next.nextpage::attr(href)').extract_first() # joins current and next page url
if next_page is not None:
next_page = response.urljoin(next_page)
yield scrapy.Request(next_page, callback=self.parse) # calls parse function again when crawled to next page
@Greg Sadetsky 提供的代码需要一些小改动。好吧,只有 parse_page 方法第一行的一个变化。
Just change xpath in the for loop from:
response.xpath("//*[@class='results']/tr/td[3]"):
to
response.xpath("//*[contains(@class,'lister-item-content')]/h3"):
这对我来说就像一个魅力!