Python 电子邮件包:如何可靠地 convert/decode 多部分邮件到 str
Python email package: how to reliably convert/decode multipart messages to str
我试图用 Python 处理旧的、可能不合规的电子邮件。我可以毫无问题地阅读消息:
In [1]: m=email.message_from_binary_file(open('/path/to/problematic:2,S',mode='rb'))
但随后将其转换为字符串时出现 UnicodeEncodeError: 'gb2312' codec can't encode character '\ufffd' in position 1238: illegal multibyte sequence.此有问题的消息的(多)部分有 "Content-Type: text/plain; charset="gb2312" 和 "Content-Transfer-Encoding: 8bit".
In [2]: m.as_string()
---------------------------------------------------------------------------
UnicodeEncodeError Traceback (most recent call last)
<ipython-input-26-919a3a20e7d8> in <module>()
----> 1 m.as_string()
~/tools/conda/envs/conda3.6/lib/python3.6/email/message.py in as_string(self, unixfrom, maxheaderlen, policy)
156 maxheaderlen=maxheaderlen,
157 policy=policy)
--> 158 g.flatten(self, unixfrom=unixfrom)
159 return fp.getvalue()
160
~/tools/conda/envs/conda3.6/lib/python3.6/email/generator.py in flatten(self, msg, unixfrom, linesep)
114 ufrom = 'From nobody ' + time.ctime(time.time())
115 self.write(ufrom + self._NL)
--> 116 self._write(msg)
117 finally:
118 self.policy = old_gen_policy
~/tools/conda/envs/conda3.6/lib/python3.6/email/generator.py in _write(self, msg)
179 self._munge_cte = None
180 self._fp = sfp = self._new_buffer()
--> 181 self._dispatch(msg)
182 finally:
183 self._fp = oldfp
~/tools/conda/envs/conda3.6/lib/python3.6/email/generator.py in _dispatch(self, msg)
212 if meth is None:
213 meth = self._writeBody
--> 214 meth(msg)
215
216 #
~/tools/conda/envs/conda3.6/lib/python3.6/email/generator.py in _handle_multipart(self, msg)
270 s = self._new_buffer()
271 g = self.clone(s)
--> 272 g.flatten(part, unixfrom=False, linesep=self._NL)
273 msgtexts.append(s.getvalue())
274 # BAW: What about boundaries that are wrapped in double-quotes?
~/tools/conda/envs/conda3.6/lib/python3.6/email/generator.py in flatten(self, msg, unixfrom, linesep)
114 ufrom = 'From nobody ' + time.ctime(time.time())
115 self.write(ufrom + self._NL)
--> 116 self._write(msg)
117 finally:
118 self.policy = old_gen_policy
~/tools/conda/envs/conda3.6/lib/python3.6/email/generator.py in _write(self, msg)
179 self._munge_cte = None
180 self._fp = sfp = self._new_buffer()
--> 181 self._dispatch(msg)
182 finally:
183 self._fp = oldfp
~/tools/conda/envs/conda3.6/lib/python3.6/email/generator.py in _dispatch(self, msg)
212 if meth is None:
213 meth = self._writeBody
--> 214 meth(msg)
215
216 #
~/tools/conda/envs/conda3.6/lib/python3.6/email/generator.py in _handle_text(self, msg)
241 msg = deepcopy(msg)
242 del msg['content-transfer-encoding']
--> 243 msg.set_payload(payload, charset)
244 payload = msg.get_payload()
245 self._munge_cte = (msg['content-transfer-encoding'],
~/tools/conda/envs/conda3.6/lib/python3.6/email/message.py in set_payload(self, payload, charset)
313 if not isinstance(charset, Charset):
314 charset = Charset(charset)
--> 315 payload = payload.encode(charset.output_charset)
316 if hasattr(payload, 'decode'):
317 self._payload = payload.decode('ascii', 'surrogateescape')
UnicodeEncodeError: 'gb2312' codec can't encode character '\ufffd' in position 1238: illegal multibyte sequence
我不太熟悉电子邮件内部结构的特性,在线搜索此类错误大多是在抓取网络时出现的问题,并且基本上暗示了一些显而易见的事情:读入的原始字节包含 Unicode 字符无法使用目标编解码器进行编码。
我的问题是:可靠地处理(可能不合规的)电子邮件的正确方法是什么?
编辑
有趣的是 m.get_payload(i=0).as_string()
会触发相同的异常,但是 m.get_payload(i=0).get_payload(decode=False)
给出了 str
在我的终端上正确显示,而 m.get_payload(i=0).get_payload(decode=True)
给出了 bytes
(b'\xd7\xaa...'
) 我无法解码。但是,错误发生在不同的字符:
----> 1 m.get_payload(i=0).get_payload(decode=True).decode('gb2312')
UnicodeDecodeError: 'gb2312' codec can't decode byte 0xac in position 1995: illegal multibyte sequence
或
----> 1 m.get_payload(i=0).get_payload(decode=True).decode('gb18030')
UnicodeDecodeError: 'gb18030' codec can't decode byte 0xa3 in position 2033: illegal multibyte sequence
简短的回答通常是 error handlers 在您的 bytes.decode
电话中。但是细节取决于很多东西。
首先,你想用这些数据做什么?通常你需要一些绝对可逆的东西,这样你就可以保证在最坏的情况下你可以重新生成你所接受的东西,在这种情况下你可能想要 surrogate-escape
。在其他情况下,您想要生成人类可读的内容,最好跳过不可能的 mojibake 而不是尝试呈现它,因此 ignore
可能是正确的答案。等等。
其次,这是绝大多数消息都很好,但有少数错误的情况,还是大多数消息都很好但有一些错误的情况?
最后,在某些情况下(对于传统的中文编码尤其如此),实际问题只是有人指定了一个密切相关的字符集,而不是他们实际使用的字符集。如果这是您所看到的,您可能想尝试编写明确的回退代码:如果您遇到异常,请在常见错误字典中查找编码并尝试替代编码。如果 none 有效,则返回使用带有错误处理程序的特定编码。
显然,如果 Content-Transfer-Encoding
是 8bit
,message.get_payload(decode=False)
仍会尝试对其进行解码以恢复原始字节。另一方面,message.get_payload(decode=True)
总是产生 bytes
,尽管只有当 Content-Transfer-Encoding
存在并且是 quoted-printable
或 base64
.
时才会实际解码。
我最终得到了以下代码。不确定这是否是处理电子邮件的正确方式。
body = []
if m.preamble is not None:
body.extend(m.preamble.splitlines(keepends=True))
for part in m.walk():
if part.is_multipart():
continue
ctype = part.get_content_type()
cte = part.get_params(header='Content-Transfer-Encoding')
if (ctype is not None and not ctype.startswith('text')) or \
(cte is not None and cte[0][0].lower() == '8bit'):
part_body = part.get_payload(decode=False)
else:
charset = part.get_content_charset()
if charset is None or len(charset) == 0:
charsets = ['ascii', 'utf-8']
else:
charsets = [charset]
part_body = part.get_payload(decode=True)
for enc in charsets:
try:
part_body = part_body.decode(enc)
break
except UnicodeDecodeError as ex:
continue
except LookupError as ex:
continue
else:
part_body = part.get_payload(decode=False)
body.extend(part_body.splitlines(keepends=True))
if m.epilogue is not None:
body.extend(m.epilogue.splitlines(keepends=True))
我试图用 Python 处理旧的、可能不合规的电子邮件。我可以毫无问题地阅读消息:
In [1]: m=email.message_from_binary_file(open('/path/to/problematic:2,S',mode='rb'))
但随后将其转换为字符串时出现 UnicodeEncodeError: 'gb2312' codec can't encode character '\ufffd' in position 1238: illegal multibyte sequence.此有问题的消息的(多)部分有 "Content-Type: text/plain; charset="gb2312" 和 "Content-Transfer-Encoding: 8bit".
In [2]: m.as_string()
---------------------------------------------------------------------------
UnicodeEncodeError Traceback (most recent call last)
<ipython-input-26-919a3a20e7d8> in <module>()
----> 1 m.as_string()
~/tools/conda/envs/conda3.6/lib/python3.6/email/message.py in as_string(self, unixfrom, maxheaderlen, policy)
156 maxheaderlen=maxheaderlen,
157 policy=policy)
--> 158 g.flatten(self, unixfrom=unixfrom)
159 return fp.getvalue()
160
~/tools/conda/envs/conda3.6/lib/python3.6/email/generator.py in flatten(self, msg, unixfrom, linesep)
114 ufrom = 'From nobody ' + time.ctime(time.time())
115 self.write(ufrom + self._NL)
--> 116 self._write(msg)
117 finally:
118 self.policy = old_gen_policy
~/tools/conda/envs/conda3.6/lib/python3.6/email/generator.py in _write(self, msg)
179 self._munge_cte = None
180 self._fp = sfp = self._new_buffer()
--> 181 self._dispatch(msg)
182 finally:
183 self._fp = oldfp
~/tools/conda/envs/conda3.6/lib/python3.6/email/generator.py in _dispatch(self, msg)
212 if meth is None:
213 meth = self._writeBody
--> 214 meth(msg)
215
216 #
~/tools/conda/envs/conda3.6/lib/python3.6/email/generator.py in _handle_multipart(self, msg)
270 s = self._new_buffer()
271 g = self.clone(s)
--> 272 g.flatten(part, unixfrom=False, linesep=self._NL)
273 msgtexts.append(s.getvalue())
274 # BAW: What about boundaries that are wrapped in double-quotes?
~/tools/conda/envs/conda3.6/lib/python3.6/email/generator.py in flatten(self, msg, unixfrom, linesep)
114 ufrom = 'From nobody ' + time.ctime(time.time())
115 self.write(ufrom + self._NL)
--> 116 self._write(msg)
117 finally:
118 self.policy = old_gen_policy
~/tools/conda/envs/conda3.6/lib/python3.6/email/generator.py in _write(self, msg)
179 self._munge_cte = None
180 self._fp = sfp = self._new_buffer()
--> 181 self._dispatch(msg)
182 finally:
183 self._fp = oldfp
~/tools/conda/envs/conda3.6/lib/python3.6/email/generator.py in _dispatch(self, msg)
212 if meth is None:
213 meth = self._writeBody
--> 214 meth(msg)
215
216 #
~/tools/conda/envs/conda3.6/lib/python3.6/email/generator.py in _handle_text(self, msg)
241 msg = deepcopy(msg)
242 del msg['content-transfer-encoding']
--> 243 msg.set_payload(payload, charset)
244 payload = msg.get_payload()
245 self._munge_cte = (msg['content-transfer-encoding'],
~/tools/conda/envs/conda3.6/lib/python3.6/email/message.py in set_payload(self, payload, charset)
313 if not isinstance(charset, Charset):
314 charset = Charset(charset)
--> 315 payload = payload.encode(charset.output_charset)
316 if hasattr(payload, 'decode'):
317 self._payload = payload.decode('ascii', 'surrogateescape')
UnicodeEncodeError: 'gb2312' codec can't encode character '\ufffd' in position 1238: illegal multibyte sequence
我不太熟悉电子邮件内部结构的特性,在线搜索此类错误大多是在抓取网络时出现的问题,并且基本上暗示了一些显而易见的事情:读入的原始字节包含 Unicode 字符无法使用目标编解码器进行编码。
我的问题是:可靠地处理(可能不合规的)电子邮件的正确方法是什么?
编辑
有趣的是 m.get_payload(i=0).as_string()
会触发相同的异常,但是 m.get_payload(i=0).get_payload(decode=False)
给出了 str
在我的终端上正确显示,而 m.get_payload(i=0).get_payload(decode=True)
给出了 bytes
(b'\xd7\xaa...'
) 我无法解码。但是,错误发生在不同的字符:
----> 1 m.get_payload(i=0).get_payload(decode=True).decode('gb2312')
UnicodeDecodeError: 'gb2312' codec can't decode byte 0xac in position 1995: illegal multibyte sequence
或
----> 1 m.get_payload(i=0).get_payload(decode=True).decode('gb18030')
UnicodeDecodeError: 'gb18030' codec can't decode byte 0xa3 in position 2033: illegal multibyte sequence
简短的回答通常是 error handlers 在您的 bytes.decode
电话中。但是细节取决于很多东西。
首先,你想用这些数据做什么?通常你需要一些绝对可逆的东西,这样你就可以保证在最坏的情况下你可以重新生成你所接受的东西,在这种情况下你可能想要 surrogate-escape
。在其他情况下,您想要生成人类可读的内容,最好跳过不可能的 mojibake 而不是尝试呈现它,因此 ignore
可能是正确的答案。等等。
其次,这是绝大多数消息都很好,但有少数错误的情况,还是大多数消息都很好但有一些错误的情况?
最后,在某些情况下(对于传统的中文编码尤其如此),实际问题只是有人指定了一个密切相关的字符集,而不是他们实际使用的字符集。如果这是您所看到的,您可能想尝试编写明确的回退代码:如果您遇到异常,请在常见错误字典中查找编码并尝试替代编码。如果 none 有效,则返回使用带有错误处理程序的特定编码。
显然,如果 Content-Transfer-Encoding
是 8bit
,message.get_payload(decode=False)
仍会尝试对其进行解码以恢复原始字节。另一方面,message.get_payload(decode=True)
总是产生 bytes
,尽管只有当 Content-Transfer-Encoding
存在并且是 quoted-printable
或 base64
.
我最终得到了以下代码。不确定这是否是处理电子邮件的正确方式。
body = []
if m.preamble is not None:
body.extend(m.preamble.splitlines(keepends=True))
for part in m.walk():
if part.is_multipart():
continue
ctype = part.get_content_type()
cte = part.get_params(header='Content-Transfer-Encoding')
if (ctype is not None and not ctype.startswith('text')) or \
(cte is not None and cte[0][0].lower() == '8bit'):
part_body = part.get_payload(decode=False)
else:
charset = part.get_content_charset()
if charset is None or len(charset) == 0:
charsets = ['ascii', 'utf-8']
else:
charsets = [charset]
part_body = part.get_payload(decode=True)
for enc in charsets:
try:
part_body = part_body.decode(enc)
break
except UnicodeDecodeError as ex:
continue
except LookupError as ex:
continue
else:
part_body = part.get_payload(decode=False)
body.extend(part_body.splitlines(keepends=True))
if m.epilogue is not None:
body.extend(m.epilogue.splitlines(keepends=True))