解析 HTML Table 得到 empy soup beautifulsoup 并请求

Parsing a HTML Table gets empy soup with beautifulsoup and request

我试图在 DataFrame 中获取此 url = "https://www.timeshighereducation.com/world-university-rankings/2020/subject-ranking/life-sciences#!/page/0/length/25/sort_by/rank/sort_order/asc/cols/stats" 中的所有 table(总共 821 行,需要所有 table)。我使用的代码是这样的:

import requests
from bs4 import BeautifulSoup
import json
url = "https://www.timeshighereducation.com/world-university-rankings/2020/subject-ranking/life-sciences#!/page/0/length/25/sort_by/rank/sort_order/asc/cols/stats"
r = requests.get(url)
soup = BeautifulSoup(r.content, 'html.parser')
print(soup) # It doesn't print anything 

我的想法是获取 soup 中的信息,然后查找标签 <script> jQuery.extend(Drupal.settings, {"basePath": ... 并进入 followig json link https://www.timeshighereducation.com/sites/default/files/the_data_rankings/life_sciences_rankings_2020_0__a2e62a5137c61efeef38fac9fb83a262.json 所有数据在哪里在 table。我已经有读取这个 json link 的功能,但首先需要在 soup 中找到信息,然后获取 json link。需要这样,因为我必须阅读许多 table 并通过手动检查获得 json link 对我来说不是一个选择。

您需要以下正则表达式模式,它在“url”之后找到所需的字符串

from bs4 import BeautifulSoup as bs
import requests
import re

with requests.Session() as s:
    s.headers = {'User-Agent':'Mozilla/5.0'}
    r = s.get('https://www.timeshighereducation.com/world-university-rankings/2020/subject-ranking/life-sciences#!/page/0/length/25/sort_by/rank/sort_order/asc/cols/stats')
    url = re.search('"url":"(.*?)"', r.text).groups(0)[0].replace('\/','/')
    data = s.get(url).json()
    print(data)