如何通过自动下载 link 访问带有 Python 的 PDF 文件?
How can I access a PDF file with Python through an automatic download link?
我正在尝试创建一个自动 Python 脚本,从那里转到 this, finds the link at the bottom of the body text (anchor text "here"), and downloads the PDF that loads after clicking said download link. I am able to retrieve the HTML from the original and find the download link, but I don't know how to get the link to the PDF 之类的网页。任何帮助将非常感激。这是我目前所拥有的:
import urllib3
from urllib.request import urlopen
from bs4 import BeautifulSoup
# Open page and locate href for bill text
url = 'https://www.murphy.senate.gov/newsroom/press-releases/murphy-blumenthal-introduce-legislation-to-create-a-national-green-bank-thousands-of-clean-energy-jobs'
html = urlopen(url)
soup = BeautifulSoup(html, 'html.parser')
links = []
for link in soup.findAll('a', href=True, text=['HERE', 'here', 'Here']):
links.append(link.get('href'))
links2 = [x for x in links if x is not None]
# Open download link to get PDF
html = urlopen(links2[0])
soup = BeautifulSoup(html, 'html.parser')
links = []
for link in soup.findAll('a'):
links.append(link.get('href'))
links2 = [x for x in links if x is not None]
此时我得到的 link 列表不包括我要查找的 PDF。有没有什么方法可以在不将 link 硬编码到代码中的 PDF 的情况下获取它(这与我在这里尝试做的事情有悖常理)?谢谢!
查找带有文本 here
的 a
元素,然后跟踪线索。
import requests
from bs4 import BeautifulSoup
url = 'https://www.murphy.senate.gov/newsroom/press-releases/murphy-blumenthal-introduce-legislation-to-create-a-national-green-bank-thousands-of-clean-energy-jobs'
user_agent = {'User-agent': 'Mozilla/5.0'}
s = requests.Session()
r = s.get(url, headers=user_agent)
soup = BeautifulSoup(r.content, 'html.parser')
for a in soup.select('a'):
if a.text == 'here':
href = a['href']
r = s.get(href, headers=user_agent)
print(r.status_code, r.reason)
print(r.headers)
_, dl_url = r.headers['refresh'].split('url=', 1)
r = s.get(dl_url, headers=user_agent)
print(r.status_code, r.reason)
print(r.headers)
file_bytes = r.content # here's your PDF; you can write it out to a file
我正在尝试创建一个自动 Python 脚本,从那里转到 this, finds the link at the bottom of the body text (anchor text "here"), and downloads the PDF that loads after clicking said download link. I am able to retrieve the HTML from the original and find the download link, but I don't know how to get the link to the PDF 之类的网页。任何帮助将非常感激。这是我目前所拥有的:
import urllib3
from urllib.request import urlopen
from bs4 import BeautifulSoup
# Open page and locate href for bill text
url = 'https://www.murphy.senate.gov/newsroom/press-releases/murphy-blumenthal-introduce-legislation-to-create-a-national-green-bank-thousands-of-clean-energy-jobs'
html = urlopen(url)
soup = BeautifulSoup(html, 'html.parser')
links = []
for link in soup.findAll('a', href=True, text=['HERE', 'here', 'Here']):
links.append(link.get('href'))
links2 = [x for x in links if x is not None]
# Open download link to get PDF
html = urlopen(links2[0])
soup = BeautifulSoup(html, 'html.parser')
links = []
for link in soup.findAll('a'):
links.append(link.get('href'))
links2 = [x for x in links if x is not None]
此时我得到的 link 列表不包括我要查找的 PDF。有没有什么方法可以在不将 link 硬编码到代码中的 PDF 的情况下获取它(这与我在这里尝试做的事情有悖常理)?谢谢!
查找带有文本 here
的 a
元素,然后跟踪线索。
import requests
from bs4 import BeautifulSoup
url = 'https://www.murphy.senate.gov/newsroom/press-releases/murphy-blumenthal-introduce-legislation-to-create-a-national-green-bank-thousands-of-clean-energy-jobs'
user_agent = {'User-agent': 'Mozilla/5.0'}
s = requests.Session()
r = s.get(url, headers=user_agent)
soup = BeautifulSoup(r.content, 'html.parser')
for a in soup.select('a'):
if a.text == 'here':
href = a['href']
r = s.get(href, headers=user_agent)
print(r.status_code, r.reason)
print(r.headers)
_, dl_url = r.headers['refresh'].split('url=', 1)
r = s.get(dl_url, headers=user_agent)
print(r.status_code, r.reason)
print(r.headers)
file_bytes = r.content # here's your PDF; you can write it out to a file