如何解析 python 中的 data-uri?
How to parse data-uri in python?
HTML 个图像元素具有 this 个简化格式:
<img src='something'>
那个东西可以是data-uri
,例如:
...
是否有使用 python 解析此内容的标准方法,以便我将 content_type
和 base64 数据分开,或者我应该为此创建自己的解析器?
在逗号上拆分数据URI,得到没有头部的base64编码数据。调用 base64.b64decode
将其解码为字节。最后,将字节写入文件。
from base64 import b64decode
data_uri = "..."
# Python 2 and <Python 3.4
header, encoded = data_uri.split(",", 1)
data = b64decode(encoded)
# Python 3.4+
# from urllib import request
# with request.urlopen(data_uri) as response:
# data = response.read()
with open("image.png", "wb") as f:
f.write(data)
这可能有帮助:
import re
from lxml import html
BASE_NAME = "image_"
source_code = """<img src="
AAAFCAYAAACNbyblAAAAHElEQVQI12P4//8/w38GIAXDIBKE0DHxgljNBAAO
9TXL0Y4OHwAAAABJRU5ErkJggg==" alt="Red dot" />
<img src="" alt="Black dot" />"""
tree = html.fromstring(source_code)
for i,image in enumerate(tree.xpath('//img[contains(@src, "data:image")]/@src')):
image_type, image_content = image.split(',', 1)
image_type = re.findall('data:image\/(\w+);base64', image_type)[0]
with open("{}{}.{}".format(BASE_NAME, i, image_type), "wb") as f:
f.write(image_content.decode('base64'))
print "[*] '{}' image found with content: {}\n".format(image_type, image_content)
输出:
[*] 'png' image found with content: iVBORw0KGgoAAAANSUhEUgAAAAUA
AAAFCAYAAACNbyblAAAAHElEQVQI12P4//8/w38GIAXDIBKE0DHxgljNBAAO
9TXL0Y4OHwAAAABJRU5ErkJggg==
[*] 'gif' image found with content: R0lGODlhAQABAIAAAAUEBAAAACwAAAAAAQABAAACAkQBADs=
它将在 <img>
标签内保存每个 base64
图像,并带有各自的文件扩展名:
前缀为 BASE_NAME + enumerate
+ image_extension
更正 JRodDynamite 的 post:
from base64 import decodestring
png_arr= "..."
png_arr = png_arr.split(",")
png_arr = png_arr[1]
fh = open("imageToSave.png", "wb")
fh.write(decodestring(png_arr))
fh.close()
w3lib(Scrapy 使用的库)有一个function 来解析数据 uris:
>>> from w3lib.url import parse_data_uri
>>> parse_data_uri('')
ParseDataURIResult(media_type='image/png', media_type_parameters={}, data=b'\x89PNG\r\n\x1a')
Python 因为 3.4 支持 data-uri,在后台使用 urllib.request.DataHandler
.
from urllib.request import urlopen
with urlopen(data_uri) as response:
data = response.read()
from urllib import request
def download(data_uri,name):
with request.urlopen(data_uri) as response:
data = response.read()
with open(name, "wb") as f:
f.write(data)
en="https://encrypted-tbn0.gstatic.com/images..."
src="data:image/png;base64,..."
download(en,"en")
download(src,"src")
HTML 个图像元素具有 this 个简化格式:
<img src='something'>
那个东西可以是data-uri
,例如:
...
是否有使用 python 解析此内容的标准方法,以便我将 content_type
和 base64 数据分开,或者我应该为此创建自己的解析器?
在逗号上拆分数据URI,得到没有头部的base64编码数据。调用 base64.b64decode
将其解码为字节。最后,将字节写入文件。
from base64 import b64decode
data_uri = "..."
# Python 2 and <Python 3.4
header, encoded = data_uri.split(",", 1)
data = b64decode(encoded)
# Python 3.4+
# from urllib import request
# with request.urlopen(data_uri) as response:
# data = response.read()
with open("image.png", "wb") as f:
f.write(data)
这可能有帮助:
import re
from lxml import html
BASE_NAME = "image_"
source_code = """<img src="
AAAFCAYAAACNbyblAAAAHElEQVQI12P4//8/w38GIAXDIBKE0DHxgljNBAAO
9TXL0Y4OHwAAAABJRU5ErkJggg==" alt="Red dot" />
<img src="" alt="Black dot" />"""
tree = html.fromstring(source_code)
for i,image in enumerate(tree.xpath('//img[contains(@src, "data:image")]/@src')):
image_type, image_content = image.split(',', 1)
image_type = re.findall('data:image\/(\w+);base64', image_type)[0]
with open("{}{}.{}".format(BASE_NAME, i, image_type), "wb") as f:
f.write(image_content.decode('base64'))
print "[*] '{}' image found with content: {}\n".format(image_type, image_content)
输出:
[*] 'png' image found with content: iVBORw0KGgoAAAANSUhEUgAAAAUA
AAAFCAYAAACNbyblAAAAHElEQVQI12P4//8/w38GIAXDIBKE0DHxgljNBAAO
9TXL0Y4OHwAAAABJRU5ErkJggg==
[*] 'gif' image found with content: R0lGODlhAQABAIAAAAUEBAAAACwAAAAAAQABAAACAkQBADs=
它将在 <img>
标签内保存每个 base64
图像,并带有各自的文件扩展名:
前缀为 BASE_NAME + enumerate
+ image_extension
更正 JRodDynamite 的 post:
from base64 import decodestring
png_arr= "..."
png_arr = png_arr.split(",")
png_arr = png_arr[1]
fh = open("imageToSave.png", "wb")
fh.write(decodestring(png_arr))
fh.close()
w3lib(Scrapy 使用的库)有一个function 来解析数据 uris:
>>> from w3lib.url import parse_data_uri
>>> parse_data_uri('')
ParseDataURIResult(media_type='image/png', media_type_parameters={}, data=b'\x89PNG\r\n\x1a')
Python 因为 3.4 支持 data-uri,在后台使用 urllib.request.DataHandler
.
from urllib.request import urlopen
with urlopen(data_uri) as response:
data = response.read()
from urllib import request
def download(data_uri,name):
with request.urlopen(data_uri) as response:
data = response.read()
with open(name, "wb") as f:
f.write(data)
en="https://encrypted-tbn0.gstatic.com/images..."
src="data:image/png;base64,..."
download(en,"en")
download(src,"src")