Python Scrapy Web 抓取:在具有 ajax 内容的 onclick 元素中获取 URL 的问题
Python Scrapy Web Scraping : problem with getting URL inside the onclick element which has ajax content
我是使用 scrapy 进行网络抓取的初学者。我尝试从 goodreads.com 中抓取特定书籍的用户评论。我想抓取所有关于书的评论,所以我必须解析每个评论页面。每个评论页面下方都有一个next_page按钮,next_page按钮的内容嵌入在onclick元素中但有问题。此 onclick link 包含 ajax 请求,我不知道如何处理这种情况。提前感谢您的帮助。
Picture of the next_page button
Its the content of onclick button
Its the remaining part of the onclick button
我也是Whosebug发帖的初学者,如有错误请见谅。 :)
我在下面分享我的抓取代码
此外,它是书中的示例link之一,页面下方有评论部分。
import scrapy
from ..items import GoodreadsItem
from scrapy import Request
from urllib.parse import urljoin
from urllib.parse import urlparse
class CrawlnscrapeSpider(scrapy.Spider):
name = 'crawlNscrape'
allowed_domains = ['www.goodreads.com']
start_urls = ['https://www.goodreads.com/list/show/702.Cozy_Mystery_Series_First_Book_of_a_Series']
def parse(self, response):
#collect all book links in this page then make request for
#parse_page function
for href in response.css("a.bookTitle::attr(href)") :
url=response.urljoin(href.extract())
yield scrapy.Request(url, callback=self.parse_page)
#go to the next page and make request for next page and call parse
#function again
next_page = response.xpath("(//a[@class='next_page'])[1]/@href")
if next_page:
url= response.urljoin(next_page[0].extract())
yield scrapy.Request(url, self.parse)
def parse_page(self, response):
#call goodreads item and create empty dictionary with name book
book = GoodreadsItem()
title = response.css("#bookTitle::text").get()
reviews = response.css(".readable span:nth-child(2)::text").getall()
#add book and reviews that earned into dictionary
book['title'] = title
book['reviews'] = reviews#take all reviews about book in single page
# i want to extract all of the review pages for any book ,
# but there is a ajax request in onclick button
# so i cant scrape link of next page.
next_page = response.xpath("(//a[@class='next_page'])[1]/@onclick")
if next_page:
url = response.urljoin(next_page[0].extract())
yield scrapy.Request(url,callback=self.parse_page)
yield book
而不是以下代码:
next_page = response.xpath("(//a[@class='next_page'])[1]/@onclick")
if next_page:
url = response.urljoin(next_page[0].extract())
yield scrapy.Request(url,callback=self.parse_page)
试试这个:
首先导入这个存储库:
from re import search
然后使用以下分页:
next_page_html = response.xpath("//a[@class='next_page' and @href='#']/@onclick").get()
if next_page_html != None:
next_page_href = search( r"Request\(.([^\']+)", next_page_html)
if next_page_href:
url = response.urljoin(next_page_href.group(1))
yield scrapy.Request(url,callback=self.parse_page)
我是使用 scrapy 进行网络抓取的初学者。我尝试从 goodreads.com 中抓取特定书籍的用户评论。我想抓取所有关于书的评论,所以我必须解析每个评论页面。每个评论页面下方都有一个next_page按钮,next_page按钮的内容嵌入在onclick元素中但有问题。此 onclick link 包含 ajax 请求,我不知道如何处理这种情况。提前感谢您的帮助。
Picture of the next_page button
Its the content of onclick button
Its the remaining part of the onclick button
我也是Whosebug发帖的初学者,如有错误请见谅。 :)
我在下面分享我的抓取代码
此外,它是书中的示例link之一,页面下方有评论部分。
import scrapy
from ..items import GoodreadsItem
from scrapy import Request
from urllib.parse import urljoin
from urllib.parse import urlparse
class CrawlnscrapeSpider(scrapy.Spider):
name = 'crawlNscrape'
allowed_domains = ['www.goodreads.com']
start_urls = ['https://www.goodreads.com/list/show/702.Cozy_Mystery_Series_First_Book_of_a_Series']
def parse(self, response):
#collect all book links in this page then make request for
#parse_page function
for href in response.css("a.bookTitle::attr(href)") :
url=response.urljoin(href.extract())
yield scrapy.Request(url, callback=self.parse_page)
#go to the next page and make request for next page and call parse
#function again
next_page = response.xpath("(//a[@class='next_page'])[1]/@href")
if next_page:
url= response.urljoin(next_page[0].extract())
yield scrapy.Request(url, self.parse)
def parse_page(self, response):
#call goodreads item and create empty dictionary with name book
book = GoodreadsItem()
title = response.css("#bookTitle::text").get()
reviews = response.css(".readable span:nth-child(2)::text").getall()
#add book and reviews that earned into dictionary
book['title'] = title
book['reviews'] = reviews#take all reviews about book in single page
# i want to extract all of the review pages for any book ,
# but there is a ajax request in onclick button
# so i cant scrape link of next page.
next_page = response.xpath("(//a[@class='next_page'])[1]/@onclick")
if next_page:
url = response.urljoin(next_page[0].extract())
yield scrapy.Request(url,callback=self.parse_page)
yield book
而不是以下代码:
next_page = response.xpath("(//a[@class='next_page'])[1]/@onclick")
if next_page:
url = response.urljoin(next_page[0].extract())
yield scrapy.Request(url,callback=self.parse_page)
试试这个:
首先导入这个存储库:
from re import search
然后使用以下分页:
next_page_html = response.xpath("//a[@class='next_page' and @href='#']/@onclick").get()
if next_page_html != None:
next_page_href = search( r"Request\(.([^\']+)", next_page_html)
if next_page_href:
url = response.urljoin(next_page_href.group(1))
yield scrapy.Request(url,callback=self.parse_page)