从硒中获取网页的字节
Getting bytes of a webpage from selenium
我正在尝试抓取带有 pdf 的网页。
根据要求,我使用以下代码获取字节并使用 open() 保存
pdf_response = requests.get(pdf_url)
with open("sample.pdf", 'wb') as f:
f.write(pdf_response.content)
f.close
而且效果很好,
但是在下面的网页上我使用了 selenium 但无法从响应对象中获取字节以用于上面的代码,
#This does not return a byte object as requests
driver = webdriver.Chrome()
driver.get(base)
content = driver.page_source.encode('utf-8').strip()
link to pdf (this has captcha that I solve with 2captcha)
我收到的当前回复
‘‘
我只能使用 requests
获取 PDF
唯一的问题:我使用pillow
生成带有完整代码的图像并显示它,我必须手动识别此代码。但如果你有一些方法可以自动识别它,那就没问题了。
import requests
import lxml.html
from PIL import Image
import io
headers = {
'User-Agent': 'User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:80.0) Gecko/20100101 Firefox/80.0',
}
# --- create Session ---
s = requests.Session()
s.headers.update(headers)
# --- load main page ---
url = 'https://www.sedar.com/GetFile.do?lang=EN&docClass=8&issuerNo=00028264&issuerType=03&projectNo=03079934&docId=4755532' # JSON
r = s.get(url)
# --- get images ---
soup = lxml.html.fromstring(r.text)
image_urls = soup.xpath('//img/@src')
# --- generate one image ---
full_image = Image.new('RGB', (40*5, 50))
for i, url in enumerate(image_urls):
#print(url)
r = s.get('https://www.sedar.com/' + url)
image = Image.open(io.BytesIO(r.content))
full_image.paste(image, (40*i, 0))
# --- ask for code ---
full_image.show()
code = input('code> ')
#print('code:', code)
# --- get PDF ---
r = s.post('https://www.sedar.com/CheckCode.do', data={'code': code})
if r.headers['Content-Type'] != 'application/pdf':
print('It is not PDF file')
else:
with open('output.pdf', 'wb') as fh:
print('size:', fh.write(r.content))
我正在尝试抓取带有 pdf 的网页。
根据要求,我使用以下代码获取字节并使用 open() 保存
pdf_response = requests.get(pdf_url)
with open("sample.pdf", 'wb') as f:
f.write(pdf_response.content)
f.close
而且效果很好,
但是在下面的网页上我使用了 selenium 但无法从响应对象中获取字节以用于上面的代码,
#This does not return a byte object as requests
driver = webdriver.Chrome()
driver.get(base)
content = driver.page_source.encode('utf-8').strip()
link to pdf (this has captcha that I solve with 2captcha)
我收到的当前回复
‘‘
我只能使用 requests
唯一的问题:我使用pillow
生成带有完整代码的图像并显示它,我必须手动识别此代码。但如果你有一些方法可以自动识别它,那就没问题了。
import requests
import lxml.html
from PIL import Image
import io
headers = {
'User-Agent': 'User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:80.0) Gecko/20100101 Firefox/80.0',
}
# --- create Session ---
s = requests.Session()
s.headers.update(headers)
# --- load main page ---
url = 'https://www.sedar.com/GetFile.do?lang=EN&docClass=8&issuerNo=00028264&issuerType=03&projectNo=03079934&docId=4755532' # JSON
r = s.get(url)
# --- get images ---
soup = lxml.html.fromstring(r.text)
image_urls = soup.xpath('//img/@src')
# --- generate one image ---
full_image = Image.new('RGB', (40*5, 50))
for i, url in enumerate(image_urls):
#print(url)
r = s.get('https://www.sedar.com/' + url)
image = Image.open(io.BytesIO(r.content))
full_image.paste(image, (40*i, 0))
# --- ask for code ---
full_image.show()
code = input('code> ')
#print('code:', code)
# --- get PDF ---
r = s.post('https://www.sedar.com/CheckCode.do', data={'code': code})
if r.headers['Content-Type'] != 'application/pdf':
print('It is not PDF file')
else:
with open('output.pdf', 'wb') as fh:
print('size:', fh.write(r.content))