python html JS函数的抓取结果

Question

我正在尝试抓取执行了 JS 脚本的网页。

我可以使用 from requests import get

使用 none 执行的 JS 获得 HTML

但是我无法像用mozilla检查网页时那样得到JS函数结果

这里是我想从中获取结果的函数：

function showFlight(idPilote, idFlight, idActivite)

on this page

知道怎么做吗？

我试过了

def kmlScrape(target):
    curl = pycurl.Curl()
    curl.setopt(pycurl.CAINFO, certifi.where())
    curl.setopt(pycurl.SSL_VERIFYPEER, 0) 
    curl.setopt(pycurl.URL, target)
    curl.setopt(pycurl.WRITEFUNCTION, e.write)
    curl.perform()
    return e

或

response = get(target)
print(response.text)
html_soup = BeautifulSoup(response.text, 'html.parser')

我想做的是以 pycurl 格式编写此 curl 命令：

curl "https://www.syride.com/scripts/ajx_vols.php?p=/en/flights/&idPays=0&pseudo=0&typePratique=0&page=01&idSpot=0&recherche=&order=&tri=&l=en" -H "User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:72.0) Gecko/20100101 Firefox/72.0" -H "Accept: */*" -H "Accept-Language: en-US,en;q=0.5" --compressed -H "X-Requested-With: XMLHttpRequest" -H "DNT: 1" -H "Connection: keep-alive" -H "Referer: https://www.syride.com/en/flights/&idPays=0&pseudo=0&typePratique=0&page=01&idSpot=0&recherche=&order=&tri=" -H "Cookie: instruments2=5ai6jgg9p6bpkh3m7slf0b2vq7; gb__ssyride=aufnlupslr259o49nc98qm1t55" -H "Cache-Control: max-age=0"

非常欢迎任何帮助。

谢谢，垫子

Answer 1

我找到了答案：

def kmlScrape(target):
    curl = pycurl.Curl()
    curl.setopt(pycurl.CAINFO, certifi.where())
    curl.setopt(pycurl.SSL_VERIFYPEER, 0) 
    curl.setopt(pycurl.URL, target)
    opts={pycurl.USERAGENT:"Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:72.0) Gecko/20100101 Firefox/72.0", pycurl.HTTPHEADER:["Accept: */*", "Accept-Language: en-US,en;q=0.5", "Connection: keep-alive", "X-Requested-With: XMLHttpRequest", "DNT: 1", "Referer: https://www.syride.com/en/flights/&idPays=0&pseudo=0&typePratique=0&page=01&idSpot=0&recherche=&order=&tri=", "Cookie: instruments2=5ai6jgg9p6bpkh3m7slf0b2vq7; gb__ssyride=aufnlupslr259o49nc98qm1t55", "Cache-Control: max-age=0"]}
    for (key,value) in opts.items():
            curl.setopt(key,value)
    curl.setopt(pycurl.WRITEFUNCTION, e.write)
    curl.perform()
    return e

Answer 2

这是对 Back-End API 的直接调用。

import requests
import pandas as pd


params = {
    'l': 'en'
}


def main(url, params):
    with requests.Session() as req:
        allin = []
        for page in range(1, 41):
            print(F"Extracting Page# {page}")
            params['page'] = page
            r = req.get(url, params=params)
            df = pd.read_html(r.content, encoding="UTF-8",
                              header=0)[0][:-1]
            allin.append(df)
        new = pd.concat(allin)
        new.to_csv("data.csv", index=False)
        print("Done")


main("https://www.syride.com/scripts/ajx_vols.php", params)

python html JS函数的抓取结果

python html scraping result of JS function

python

pycurl

web-scraping

python-requests