Python: 无法从网站提取 tbody 信息

Question

我想提取这个网站的所有链接：https://pflegefinder.bkk-dachverband.de/pflegeheime/searchresult.php?required=1&statistics=1&searchdata%5BmaxDistance%5D=0&searchdata%5BcareType%5D=inpatientCare#/tab/general

我要的信息存储在tbody中：page code

每次我尝试提取数据时都没有结果。

from bs4 import BeautifulSoup
import requests
from requests_html import HTMLSession

url = "https://pflegefinder.bkk-dachverband.de/pflegeheime/searchresult.php?required=1&statistics=1&searchdata%5BmaxDistance%5D=0&searchdata%5BcareType%5D=inpatientCare#complex-searchresult"



session = HTMLSession()
r = session.get(url)
r.html.render()

soup = BeautifulSoup(r.html.html,'html.parser')

print(r.html.search("Details"))

感谢您的帮助！

Answer 1

该站点使用后端 api 传递信息，如果您查看浏览器的开发人员工具 - 网络 - fetch/XHR 并刷新页面，您将看到通过 json 在与您发布的 url 类似的请求中。

你可以像这样抓取数据，它 returns json 很容易解析：

import requests

headers = {
    'Referer':'https://pflegefinder.bkk-dachverband.de/pflegeheime/searchresult.php?required=1&statistics=1&searchdata%5BmaxDistance%5D=0&searchdata%5BcareType%5D=inpatientCare',
    'User-Agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/97.0.4692.71 Safari/537.36'
    }

for page in range(2):

    url = f'https://pflegefinder.bkk-dachverband.de/api/nursing-homes?required=1&statistics=1&maxDistance=0&careType=inpatientCare&limit=20&offset={page*20}'
    resp = requests.get(url,headers=headers).json()
    print(resp)

api 检查您是否有“推荐人”header，否则您会收到 400 响应。

Python: 无法从网站提取 tbody 信息

Python: Can't extract tbody information from website

python-3.x

web-scraping

beautifulsoup

html-tbody