Encoding error: in MIME file data via AWS SES

Question

我正在尝试通过 aws SES 从 MIME 检索文件格式和文件名等附件数据。不幸的是，有时文件名编码被更改，比如文件名是“3_amrishmishra_Entry Level Resume - 02.pdf”，在 MIME 中它显示为 '=?UTF-8?Q?amrishmishra=5FEntry_Level_Resume_ =E2=80=93_02=2Epdf?=', 有什么方法可以得到准确的文件名吗?

if email_message.is_multipart():
message = ''
if "apply" in receiver_email.split('@')[0].split('_')[0] and isinstance(int(receiver_email.split('@')[0].split('_')[1]), int):
    for part in email_message.walk():
        content_type = str(part.get_content_type()).lower()
        content_dispo = str(part.get('Content-Disposition')).lower()
        print(content_type, content_dispo)

        if 'text/plain' in content_type and "attachment" not in content_dispo:
            message = part.get_payload()


        if content_type in ['application/pdf', 'text/plain', 'application/vnd.openxmlformats-officedocument.wordprocessingml.document', 'image/jpeg', 'image/jpg', 'image/png', 'image/gif'] and "attachment" in content_dispo:
            filename = part.get_filename()
            # open('/tmp/local' + filename, 'wb').write(part.get_payload(decode=True))
            # s3r.meta.client.upload_file('/tmp/local' + filename, bucket_to_upload, filename)

            data = {
                'base64_resume': part.get_payload(),
                'filename': filename,
            }
            data_list.append(data)
    try:
        api_data = {
            'email_data': email_data,
            'resumes_data': data_list
        }
        print(len(data_list))
        response = requests.post(url, data=json.dumps(api_data),
                                 headers={'content-type': 'application/json'})
        print(response.status_code, response.content)
    except Exception as e:
        print("error %s" % e)

Answer 1

此语法 '=?UTF-8?Q?...?=' 是 MIME encoded word. It is used in MIME email when a header value includes non-ASCII characters (gory details in RFC 2047)。您的附件文件名包含一个“破折号”字符，这就是使用此编码发送它的原因。

处理它的最佳方法取决于您使用的 Python 版本...

Python 3

Python 3 的更新 email.parser 包可以为您正确解码 RFC 2047 headers:

# Python 3
from email import message_from_bytes, policy

raw_message_bytes = b"<< the MIME message you downloaded from SES >>"
message = message_from_bytes(raw_message_bytes, policy=policy.default)
for attachment in message.iter_attachments():
    # (EmailMessage.iter_attachments is new in Python 3)
    print(attachment.get_filename())
    # amrishmishra_Entry Level Resume – 02.pdf

您必须明确要求 policy.default。如果不这样做，解析器将使用 compat32 策略来复制 Python 2.7 的错误行为——包括不解码 RFC 2047。（此外，早期的 Python 3 版本仍在摇晃解决新电子邮件包中的错误，因此请确保您使用的是 Python 3.5 或更高版本。）

Python 2

如果您使用的是 Python 2，best 选项将尽可能升级到 Python 3.5 或更高版本。 Python 2 的电子邮件解析器有 许多错误和限制，这些错误和限制已在 Python 3 中通过大量重写得到修复。（并且重写添加了方便的新功能，例如 iter_attachments() 如上所示。)

如果您不能切换到 Python 3，您可以使用 email.header.decode_header:

自行解码 RFC 2047 文件名

# Python 2 (also works in Python 3, but you shouldn't need it there)
from email.header import decode_header

filename = '=?UTF-8?Q?amrishmishra=5FEntry_Level_Resume_=E2=80=93_02=2Epdf?='
decode_header(filename)
# [('amrishmishra_Entry Level Resume \xe2\x80\x93 02.pdf', 'utf-8')]

(decoded_string, charset) = decode_header(filename)[0]
decoded_string.decode(charset)
# u'amrishmishra_Entry Level Resume – 02.pdf'

但同样，如果您尝试在 Python 2.7 中解析 real-world 电子邮件，请注意这可能 只是几个中的第一个 你会遇到的问题。

django-anymail package I maintain includes a compatibility version of email.parser.BytesParser that tries to work around several (but not all) other bugs in Python 2.7 email parsing. You may be able to borrow that (internal) code for your purposes. (Or since you tagged your question Django, you might want to look into Anymail's normalized inbound email 处理，包括 Amazon SES 支持。）

Encoding error: in MIME file data via AWS SES

Encoding error: in MIME file data via AWS SES

python

django

mime

amazon-ses