使用 python 登录验证并下载多个文件

login authentication and download multiple files using python

我正在尝试从网站下载多个文件。每个文件的 link 是不同的,并存储在一个 txt 文件中。要下载文件,用户必须登录。但是,我正在尝试下载超过 10000 个文件。

是否有更好的方法来提供登录凭据,以便只进行一次身份验证并按照下面的代码迭代下载文件。

import wget
# import requests
f = open("datalinks.txt", "r")
lnks = f.readlines()

for eachlink in lnks:
    if '.h5' in eachlink:
        file_url = eachlink
        wget.download(file_url)

如果不了解该站点以及登录过程的工作原理,将很难为您提供帮助。

一些一般建议是使用 requests.Session() 功能,这允许您在抓取期间保持会话打开,并且 cookie 始终保持不变。这是一个例子:

import requests

s = requests.Session()

login_url = 'https://www.website.com/login'
username = 'joe'
password = 'hunter2'

payload = {'username': username,
            'password': password}

header = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/81.0.4044.138 Safari/537.36'}

entry = s.post(login_url,headers=header,data=payload)
print(entry.status_code)

### continue scraping with s Session
data = s.get('other_url')
print(data.content)

现在更新您已经提供了 url,它比我预期的要复杂!您需要先获取由 javascript 加载的登录令牌,我选择使用 Selenium:

import requests
from selenium import webdriver
import time

url = 'https://gportal.jaxa.jp/gpr/auth'

driver = webdriver.Chrome(executable_path=r"path_to_your_chrome_driver") #https://chromedriver.chromium.org/downloads
driver.get(url)
time.sleep(2)

cookies = driver.get_cookies()

driver.close()
driver.quit()

cookie_str = ''
for cookie in cookies:
    name = cookie['name']
    if 'csrf_token' in name:
        fuel_csrf_token = cookie['value']
    value = cookie['value']
    cookie_str += f'{name}={value}; '

print(cookie_str)
print(fuel_csrf_token)

s = requests.Session()

username = 'name'
pw = 'password'

payload = {
    'account':username,
    'password':pw,
    'fuel_csrf_token': fuel_csrf_token
    }

headers = {
    'Accept':'application/json, text/javascript, */*; q=0.01',
    'Accept-Encoding':'gzip, deflate, br',
    'Content-Type':'application/x-www-form-urlencoded; charset=UTF-8',
    'Cookie': cookie_str,
    'Host':'gportal.jaxa.jp',
    'Origin':'https://gportal.jaxa.jp',
    'Referer':'https://gportal.jaxa.jp/gpr/auth',
    'User-Agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/97.0.4692.99 Safari/537.36',
    'X-Requested-With':'XMLHttpRequest'
    }

login_url = 'https://gportal.jaxa.jp/gpr/auth/authenticate.json'

login = s.post(login_url,headers=headers,data=payload)
print(login)