使用 python 登录验证并下载多个文件
login authentication and download multiple files using python
我正在尝试从网站下载多个文件。每个文件的 link 是不同的,并存储在一个 txt 文件中。要下载文件,用户必须登录。但是,我正在尝试下载超过 10000 个文件。
是否有更好的方法来提供登录凭据,以便只进行一次身份验证并按照下面的代码迭代下载文件。
import wget
# import requests
f = open("datalinks.txt", "r")
lnks = f.readlines()
for eachlink in lnks:
if '.h5' in eachlink:
file_url = eachlink
wget.download(file_url)
如果不了解该站点以及登录过程的工作原理,将很难为您提供帮助。
一些一般建议是使用 requests.Session()
功能,这允许您在抓取期间保持会话打开,并且 cookie 始终保持不变。这是一个例子:
import requests
s = requests.Session()
login_url = 'https://www.website.com/login'
username = 'joe'
password = 'hunter2'
payload = {'username': username,
'password': password}
header = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/81.0.4044.138 Safari/537.36'}
entry = s.post(login_url,headers=header,data=payload)
print(entry.status_code)
### continue scraping with s Session
data = s.get('other_url')
print(data.content)
现在更新您已经提供了 url,它比我预期的要复杂!您需要先获取由 javascript 加载的登录令牌,我选择使用 Selenium:
import requests
from selenium import webdriver
import time
url = 'https://gportal.jaxa.jp/gpr/auth'
driver = webdriver.Chrome(executable_path=r"path_to_your_chrome_driver") #https://chromedriver.chromium.org/downloads
driver.get(url)
time.sleep(2)
cookies = driver.get_cookies()
driver.close()
driver.quit()
cookie_str = ''
for cookie in cookies:
name = cookie['name']
if 'csrf_token' in name:
fuel_csrf_token = cookie['value']
value = cookie['value']
cookie_str += f'{name}={value}; '
print(cookie_str)
print(fuel_csrf_token)
s = requests.Session()
username = 'name'
pw = 'password'
payload = {
'account':username,
'password':pw,
'fuel_csrf_token': fuel_csrf_token
}
headers = {
'Accept':'application/json, text/javascript, */*; q=0.01',
'Accept-Encoding':'gzip, deflate, br',
'Content-Type':'application/x-www-form-urlencoded; charset=UTF-8',
'Cookie': cookie_str,
'Host':'gportal.jaxa.jp',
'Origin':'https://gportal.jaxa.jp',
'Referer':'https://gportal.jaxa.jp/gpr/auth',
'User-Agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/97.0.4692.99 Safari/537.36',
'X-Requested-With':'XMLHttpRequest'
}
login_url = 'https://gportal.jaxa.jp/gpr/auth/authenticate.json'
login = s.post(login_url,headers=headers,data=payload)
print(login)
我正在尝试从网站下载多个文件。每个文件的 link 是不同的,并存储在一个 txt 文件中。要下载文件,用户必须登录。但是,我正在尝试下载超过 10000 个文件。
是否有更好的方法来提供登录凭据,以便只进行一次身份验证并按照下面的代码迭代下载文件。
import wget
# import requests
f = open("datalinks.txt", "r")
lnks = f.readlines()
for eachlink in lnks:
if '.h5' in eachlink:
file_url = eachlink
wget.download(file_url)
如果不了解该站点以及登录过程的工作原理,将很难为您提供帮助。
一些一般建议是使用 requests.Session()
功能,这允许您在抓取期间保持会话打开,并且 cookie 始终保持不变。这是一个例子:
import requests
s = requests.Session()
login_url = 'https://www.website.com/login'
username = 'joe'
password = 'hunter2'
payload = {'username': username,
'password': password}
header = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/81.0.4044.138 Safari/537.36'}
entry = s.post(login_url,headers=header,data=payload)
print(entry.status_code)
### continue scraping with s Session
data = s.get('other_url')
print(data.content)
现在更新您已经提供了 url,它比我预期的要复杂!您需要先获取由 javascript 加载的登录令牌,我选择使用 Selenium:
import requests
from selenium import webdriver
import time
url = 'https://gportal.jaxa.jp/gpr/auth'
driver = webdriver.Chrome(executable_path=r"path_to_your_chrome_driver") #https://chromedriver.chromium.org/downloads
driver.get(url)
time.sleep(2)
cookies = driver.get_cookies()
driver.close()
driver.quit()
cookie_str = ''
for cookie in cookies:
name = cookie['name']
if 'csrf_token' in name:
fuel_csrf_token = cookie['value']
value = cookie['value']
cookie_str += f'{name}={value}; '
print(cookie_str)
print(fuel_csrf_token)
s = requests.Session()
username = 'name'
pw = 'password'
payload = {
'account':username,
'password':pw,
'fuel_csrf_token': fuel_csrf_token
}
headers = {
'Accept':'application/json, text/javascript, */*; q=0.01',
'Accept-Encoding':'gzip, deflate, br',
'Content-Type':'application/x-www-form-urlencoded; charset=UTF-8',
'Cookie': cookie_str,
'Host':'gportal.jaxa.jp',
'Origin':'https://gportal.jaxa.jp',
'Referer':'https://gportal.jaxa.jp/gpr/auth',
'User-Agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/97.0.4692.99 Safari/537.36',
'X-Requested-With':'XMLHttpRequest'
}
login_url = 'https://gportal.jaxa.jp/gpr/auth/authenticate.json'
login = s.post(login_url,headers=headers,data=payload)
print(login)