如果按钮的 href 是 javascript:void(0),如何使用 Scrapy 和 Splash 处理分页
How can I handle pagination with Scrapy and Splash, if the href of the button is javascript:void(0)
我试图从这个网站抓取大学的名称和链接:https://www.topuniversities.com/university-rankings/world-university-rankings/2021,在处理分页时遇到了问题,因为指向下一页的按钮的 href 是 javascript:void(0),所以我无法用scrapy.Request()或response.follow()到达下一页,有没有办法处理这样的分页?
screen shot of the website
screen shot of the tag and href
这个网站的URL没有参数,如果点击下一页按钮,URL保持不变,所以我无法通过改变[=28=来处理分页].
下面的代码片段只能获取第一页和第二页的大学名称和链接:
import scrapy
from scrapy_splash import SplashRequest
class UniSpider(scrapy.Spider):
name = 'uni'
allowed_domains = ['www.topuniversities.com']
script = """
function main(splash, args)
splash:set_user_agent("Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/101.0.4951.54 Safari/537.36")
splash.private_mode_enabled = false
assert(splash:go(args.url))
assert(splash:wait(3))
return {
html = splash:html()
}
end
"""
next_page = """
function main(splash, args)
splash:set_user_agent("Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/101.0.4951.54 Safari/537.36")
splash.private_mode_enabled = false
assert(splash:go(args.url))
assert(splash:wait(3))
local btn = assert(splash:jsfunc([[
function(){
document.querySelector("#alt-style-pagination a.page-link.next").click()
}
]]))
assert(splash:wait(2))
btn()
splash:set_viewport_full()
assert(splash:wait(3))
return {
html = splash:html()
}
end
"""
def start_requests(self):
yield SplashRequest(
url="https://www.topuniversities.com/university-rankings/world-university-rankings/2021",
callback=self.parse, endpoint="execute",
args={"lua_source": self.script})
def parse(self, response):
for uni in response.css("a.uni-link"):
uni_link = response.urljoin(uni.css("::attr(href)").get())
yield {
"name": uni.css("::text").get(),
"link": uni_link
}
yield SplashRequest(
url=response.url,
callback=self.parse, endpoint="execute",
args={"lua_source": self.next_page}
)
这个简单的网站不需要启动画面。
尝试加载以下 link:
https://www.topuniversities.com/sites/default/files/qs-rankings-data/en/2057712.txt
这里面有所有的大学,网站加载这个file/json只加载一次然后分页显示信息
这是短代码(没有使用 scrapy):
from requests import get
from json import loads, dumps
from lxml.html import fromstring
url = "https://www.topuniversities.com/sites/default/files/qs-rankings-data/en/2057712.txt"
html = get(url, stream=True)
## another approach for loading json
# jdata = loads(html.content.decode())
jdata = html.json()
for x in jdata['data']:
core_id = x['core_id']
country = x['country']
city = x['city']
guide = x['guide']
nid = x['nid']
title = x['title']
logo = x['logo']
score = x['score']
rank_display = x['rank_display']
region = x['region']
stars = x['stars']
recm = x['recm']
dagger = x['dagger']
## convert title to text
soup = fromstring(title)
title = soup.xpath(".//a/text()")[0]
print ( title )
以上代码打印了个别大学的 'title',请尝试将其与其他可用列一起保存在 CSV/Excel 文件中。结果如下:
Massachusetts Institute of Technology (MIT)
Stanford University
Harvard University
California Institute of Technology (Caltech)
University of Oxford
ETH Zurich - Swiss Federal Institute of Technology
University of Cambridge
Imperial College London
我试图从这个网站抓取大学的名称和链接:https://www.topuniversities.com/university-rankings/world-university-rankings/2021,在处理分页时遇到了问题,因为指向下一页的按钮的 href 是 javascript:void(0),所以我无法用scrapy.Request()或response.follow()到达下一页,有没有办法处理这样的分页?
screen shot of the website
screen shot of the tag and href
这个网站的URL没有参数,如果点击下一页按钮,URL保持不变,所以我无法通过改变[=28=来处理分页].
下面的代码片段只能获取第一页和第二页的大学名称和链接:
import scrapy
from scrapy_splash import SplashRequest
class UniSpider(scrapy.Spider):
name = 'uni'
allowed_domains = ['www.topuniversities.com']
script = """
function main(splash, args)
splash:set_user_agent("Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/101.0.4951.54 Safari/537.36")
splash.private_mode_enabled = false
assert(splash:go(args.url))
assert(splash:wait(3))
return {
html = splash:html()
}
end
"""
next_page = """
function main(splash, args)
splash:set_user_agent("Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/101.0.4951.54 Safari/537.36")
splash.private_mode_enabled = false
assert(splash:go(args.url))
assert(splash:wait(3))
local btn = assert(splash:jsfunc([[
function(){
document.querySelector("#alt-style-pagination a.page-link.next").click()
}
]]))
assert(splash:wait(2))
btn()
splash:set_viewport_full()
assert(splash:wait(3))
return {
html = splash:html()
}
end
"""
def start_requests(self):
yield SplashRequest(
url="https://www.topuniversities.com/university-rankings/world-university-rankings/2021",
callback=self.parse, endpoint="execute",
args={"lua_source": self.script})
def parse(self, response):
for uni in response.css("a.uni-link"):
uni_link = response.urljoin(uni.css("::attr(href)").get())
yield {
"name": uni.css("::text").get(),
"link": uni_link
}
yield SplashRequest(
url=response.url,
callback=self.parse, endpoint="execute",
args={"lua_source": self.next_page}
)
这个简单的网站不需要启动画面。
尝试加载以下 link:
https://www.topuniversities.com/sites/default/files/qs-rankings-data/en/2057712.txt
这里面有所有的大学,网站加载这个file/json只加载一次然后分页显示信息
这是短代码(没有使用 scrapy):
from requests import get
from json import loads, dumps
from lxml.html import fromstring
url = "https://www.topuniversities.com/sites/default/files/qs-rankings-data/en/2057712.txt"
html = get(url, stream=True)
## another approach for loading json
# jdata = loads(html.content.decode())
jdata = html.json()
for x in jdata['data']:
core_id = x['core_id']
country = x['country']
city = x['city']
guide = x['guide']
nid = x['nid']
title = x['title']
logo = x['logo']
score = x['score']
rank_display = x['rank_display']
region = x['region']
stars = x['stars']
recm = x['recm']
dagger = x['dagger']
## convert title to text
soup = fromstring(title)
title = soup.xpath(".//a/text()")[0]
print ( title )
以上代码打印了个别大学的 'title',请尝试将其与其他可用列一起保存在 CSV/Excel 文件中。结果如下:
Massachusetts Institute of Technology (MIT)
Stanford University
Harvard University
California Institute of Technology (Caltech)
University of Oxford
ETH Zurich - Swiss Federal Institute of Technology
University of Cambridge
Imperial College London