解析 HTML Table 得到 empy soup beautifulsoup 并请求
Parsing a HTML Table gets empy soup with beautifulsoup and request
我试图在 DataFrame 中获取此 url = "https://www.timeshighereducation.com/world-university-rankings/2020/subject-ranking/life-sciences#!/page/0/length/25/sort_by/rank/sort_order/asc/cols/stats"
中的所有 table(总共 821 行,需要所有 table)。我使用的代码是这样的:
import requests
from bs4 import BeautifulSoup
import json
url = "https://www.timeshighereducation.com/world-university-rankings/2020/subject-ranking/life-sciences#!/page/0/length/25/sort_by/rank/sort_order/asc/cols/stats"
r = requests.get(url)
soup = BeautifulSoup(r.content, 'html.parser')
print(soup) # It doesn't print anything
我的想法是获取 soup 中的信息,然后查找标签 <script> jQuery.extend(Drupal.settings, {"basePath": ...
并进入 followig json link https://www.timeshighereducation.com/sites/default/files/the_data_rankings/life_sciences_rankings_2020_0__a2e62a5137c61efeef38fac9fb83a262.json
所有数据在哪里在 table。我已经有读取这个 json link 的功能,但首先需要在 soup 中找到信息,然后获取 json link。需要这样,因为我必须阅读许多 table 并通过手动检查获得 json link 对我来说不是一个选择。
您需要以下正则表达式模式,它在“url”之后找到所需的字符串
from bs4 import BeautifulSoup as bs
import requests
import re
with requests.Session() as s:
s.headers = {'User-Agent':'Mozilla/5.0'}
r = s.get('https://www.timeshighereducation.com/world-university-rankings/2020/subject-ranking/life-sciences#!/page/0/length/25/sort_by/rank/sort_order/asc/cols/stats')
url = re.search('"url":"(.*?)"', r.text).groups(0)[0].replace('\/','/')
data = s.get(url).json()
print(data)
我试图在 DataFrame 中获取此 url = "https://www.timeshighereducation.com/world-university-rankings/2020/subject-ranking/life-sciences#!/page/0/length/25/sort_by/rank/sort_order/asc/cols/stats"
中的所有 table(总共 821 行,需要所有 table)。我使用的代码是这样的:
import requests
from bs4 import BeautifulSoup
import json
url = "https://www.timeshighereducation.com/world-university-rankings/2020/subject-ranking/life-sciences#!/page/0/length/25/sort_by/rank/sort_order/asc/cols/stats"
r = requests.get(url)
soup = BeautifulSoup(r.content, 'html.parser')
print(soup) # It doesn't print anything
我的想法是获取 soup 中的信息,然后查找标签 <script> jQuery.extend(Drupal.settings, {"basePath": ...
并进入 followig json link https://www.timeshighereducation.com/sites/default/files/the_data_rankings/life_sciences_rankings_2020_0__a2e62a5137c61efeef38fac9fb83a262.json
所有数据在哪里在 table。我已经有读取这个 json link 的功能,但首先需要在 soup 中找到信息,然后获取 json link。需要这样,因为我必须阅读许多 table 并通过手动检查获得 json link 对我来说不是一个选择。
您需要以下正则表达式模式,它在“url”之后找到所需的字符串
from bs4 import BeautifulSoup as bs
import requests
import re
with requests.Session() as s:
s.headers = {'User-Agent':'Mozilla/5.0'}
r = s.get('https://www.timeshighereducation.com/world-university-rankings/2020/subject-ranking/life-sciences#!/page/0/length/25/sort_by/rank/sort_order/asc/cols/stats')
url = re.search('"url":"(.*?)"', r.text).groups(0)[0].replace('\/','/')
data = s.get(url).json()
print(data)