从电子邮件中提取文本后,空格替换为 =20
Spaces replaced by =20 after extracting text from email
我尝试使用 python 中的电子邮件和 imaplib 模块获取收到的 gmail 的文本。用utf-8解码后,得到报文的payload后,所有的空格还是替换成=20。我可以使用另一个解码步骤来解决这个问题吗?
代码如下:(我从 youtube 教程中得到的 - https://youtu.be/Jt8LizzxkPU)
``
import email
import imaplib
username = "abc"
password = "123"
mail = imaplib.IMAP4_SSL("imap.gmail.com")
mail.login(username,password)
mail.select("inbox")
result, data = mail.uid("search", None,"ALL")
inbox_item_list = data[0].split()
for item in inbox_item_list:
#most_recent = inbox_item_list[-1]
#oldest = inbox_item_list[0]
result2, email_data = mail.uid('fetch',item,'(RFC822)')
raw_email = email_data[0][1].decode("utf-8")
email_message = email.message_from_string(raw_email)
to_ = email_message['To']
from_ = email_message['From']
subject_ = email_message['Subject']
counter = 1
for part in email_message.walk():
if part.get_content_maintype() == "multipart":
continue
filename = part.get_filename()
if not filename:
ext = ".html"
filename = "msg-part-%08d%s" %(counter, ext)
counter += 1
#save file
content_type = part.get_content_type()
print(subject_)
print (content_type)
if "plain" in content_type:
print(part.get_payload())
elif "html" in content_type:
print("do some beautiful soup")
else:
print(content_type)
``
这是一个完整的代码示例,说明如何解码一封简单的电子邮件(包含文字 =20
以及应由 space 替换的 =20
序列) :
#!/usr/bin/env python3
import email.policy
email_text = """Subject: =?UTF-8?B?dGVzdCDwn5OnID0yMA==?=
Content-Type: text/plain; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable
loooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooo=
oooooooooooooooooooooooooooooong=20word
=3D20
^ line starts with =3D20
emoji: <=F0=9F=93=A7>"""
msg = email.message_from_string(
email_text, policy=email.policy.default
)
print("Subject: <{subject}>".format_map(msg))
assert not msg.is_multipart()
print(msg.get_content())
输出
Subject: <test =20>
loooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooong word
=20
^ line starts with =20
emoji: <>
msg.walk()
、part.get_payload(decode=True)
可以用来遍历更复杂的EmailMessage
对象。参见 email
Examples。
尝试import quopri
,然后当你得到邮件正文的内容(或者任何里面有=20s
的文本),你可以使用quopri.decodestring()
我是这样做的
quopri.decodestring(part.get_payload())
但请记住,如果您非常明确地想要从 quoted-printable
解码,请记住这是正确的。通常我会说@jfs 的答案更简洁。
我尝试使用 python 中的电子邮件和 imaplib 模块获取收到的 gmail 的文本。用utf-8解码后,得到报文的payload后,所有的空格还是替换成=20。我可以使用另一个解码步骤来解决这个问题吗?
代码如下:(我从 youtube 教程中得到的 - https://youtu.be/Jt8LizzxkPU)
``
import email
import imaplib
username = "abc"
password = "123"
mail = imaplib.IMAP4_SSL("imap.gmail.com")
mail.login(username,password)
mail.select("inbox")
result, data = mail.uid("search", None,"ALL")
inbox_item_list = data[0].split()
for item in inbox_item_list:
#most_recent = inbox_item_list[-1]
#oldest = inbox_item_list[0]
result2, email_data = mail.uid('fetch',item,'(RFC822)')
raw_email = email_data[0][1].decode("utf-8")
email_message = email.message_from_string(raw_email)
to_ = email_message['To']
from_ = email_message['From']
subject_ = email_message['Subject']
counter = 1
for part in email_message.walk():
if part.get_content_maintype() == "multipart":
continue
filename = part.get_filename()
if not filename:
ext = ".html"
filename = "msg-part-%08d%s" %(counter, ext)
counter += 1
#save file
content_type = part.get_content_type()
print(subject_)
print (content_type)
if "plain" in content_type:
print(part.get_payload())
elif "html" in content_type:
print("do some beautiful soup")
else:
print(content_type)
``
这是一个完整的代码示例,说明如何解码一封简单的电子邮件(包含文字 =20
以及应由 space 替换的 =20
序列) :
#!/usr/bin/env python3
import email.policy
email_text = """Subject: =?UTF-8?B?dGVzdCDwn5OnID0yMA==?=
Content-Type: text/plain; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable
loooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooo=
oooooooooooooooooooooooooooooong=20word
=3D20
^ line starts with =3D20
emoji: <=F0=9F=93=A7>"""
msg = email.message_from_string(
email_text, policy=email.policy.default
)
print("Subject: <{subject}>".format_map(msg))
assert not msg.is_multipart()
print(msg.get_content())
输出
Subject: <test =20>
loooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooong word
=20
^ line starts with =20
emoji: <>
msg.walk()
、part.get_payload(decode=True)
可以用来遍历更复杂的EmailMessage
对象。参见 email
Examples。
尝试import quopri
,然后当你得到邮件正文的内容(或者任何里面有=20s
的文本),你可以使用quopri.decodestring()
我是这样做的
quopri.decodestring(part.get_payload())
但请记住,如果您非常明确地想要从 quoted-printable
解码,请记住这是正确的。通常我会说@jfs 的答案更简洁。