IMAP 中的换行符 - =\r\n - 如何解码?
Line breaks in IMAP - =\r\n - how to decode?
我正在尝试制作一个电子邮件抓取工具,用于抓取某些电子邮件以查找值以将它们存储在 CSV 文件中。我一直在尝试很多事情来解决这个问题,但到目前为止都没有成功。
# Function to get email content part i.e its body part
def get_body(msg):
if msg.is_multipart():
return get_body(msg.get_payload(decode=True)).decode()
else:
return msg.get_payload(decode=True).decode()
# Function to search for a key value pair
def search(key, value, con):
result, data = con.search(None, key, '"{}"'.format(value))
return data
# Function to get the list of emails under this label
def get_emails(result_bytes):
print("get email")
msgs = [] # all the email data are pushed inside an array
for num in result_bytes[0].split():
typ, data = con.fetch(num, '(RFC822)')
msgs.append(data)
return msgs
# this is done to make SSL connection with GMAIL
con = imaplib.IMAP4_SSL(imap_url)
con.login(user, password)
con.select('Inbox')
msg_ids = get_emails(search('SUBJECT', 'TESTTITELPYTHON', con))
for msg in msg_ids[::-1]:
for sent in msg:
if type(sent) is tuple:
print(msg)
# encoding set as utf-8
content = sent[1], 'utf-8'
data = str(content)
# Handling errors related to unicodenecode
try:
indexstart = data.find("span")
data2 = data[indexstart + 5: len(data)]
indexend = data2.find("</div>")
# printtng the required content which we need
# to extract from our email i.e our body
waarde = data2[0: indexend]
test_naam_1 = waarde.split("Naam: ",1)[1]
echte_naam = test_naam_1.split("Email: ",-1)[0]
email_test = waarde.split("Email: ",1)[1]
echte_email = email_test.split("Tel nr.: ",-1)[0]
tel_test = waarde.split("Tel nr.: ",1)[1]
echte_tel = tel_test.split("Onderwerp: ",-1)[0]
subj_test = waarde.split("Onderwerp: ",1)[1]
echte_subj = subj_test.split("Bericht: ",-1)[0]
print("---ADRESGEGEVENS---")
print("---Naam: " + echte_naam + "---")
print("---Naam: " + echte_email + "---")
print("---Naam: " + echte_tel + "---")
print("---Naam: " + echte_subj + "---")
现在在我的结果中,我仍然收到这些丑陋的换行符,它们在我的标记中如下所示:
[(b'12638 (RFC822 {1973}', b'MIME-Version: 1.0\r\nDate: Mon, 25 Oct 2021 16:41:46 +0200\r\nMessage-ID: <CAJDn=xsVynQqp7BwYoGZB=v21-AAR5=xcMkQ8D2kXE7ZpYFNNQ@mail.example.com>\r\nSubject: TESTTITELPYTHON\r\nFrom: Patrick Merkx <patrick@example.nl>\r\nTo: Patrick Merkx <patrick@example.nl>\r\nContent-Type: multipart/alternative; boundary="00000000000042e6ae05cf2e5c7e"\r\n\r\n--00000000000042e6ae05cf2e5c7e\r\nContent-Type: text/plain; charset="UTF-8"\r\n\r\nContactformulier ingevuld door:\r\nNaam: Patrick Merkx\r\nEmail: merkx.patrick@example.com\r\nTel nr.: 0611381219\r\n\r\nOnderwerp: Nog een test\r\n\r\nBericht:\r\nBericht\r\n\r\n--00000000000042e6ae05cf2e5c7e\r\nContent-Type: text/html; charset="UTF-8"\r\nContent-Transfer-Encoding: quoted-printable\r\n\r\n<div dir=3D"ltr"><div><div dir=3D"ltr" class=3D"gmail_signature" data-smart=\r\nmail=3D"gmail_signature"><div dir=3D"ltr"><div><div dir=3D"ltr"><div><div d=\r\nir=3D"ltr"><div style=3D"font-stretch:normal;font-size:13.33px;line-height:=\r\n19.99px;background:none;border:0px rgb(34,34,34);width:600px;overflow:visib=\r\nle;min-height:0px;outline-width:0px"><span class=3D"gmail-il" style=3D"font=\r\n-size:small">Contactformulier</span><span style=3D"font-size:small">=C2=A0i=\r\nngevuld door:</span><br style=3D"font-size:small"><span style=3D"font-size:=\r\nsmall">Naam: Patrick Merkx</span><br style=3D"font-size:small"><span style=\r\n=3D"font-size:small">Email:=C2=A0</span><a href=3D"mailto:merkx.patrick@gma=\r\nil.com" target=3D"_blank" style=3D"font-size:small">merkx.patrick@example.com=\r\n</a><br style=3D"font-size:small"><span style=3D"font-size:small">Tel nr.: =\r\n0611381219</span><br style=3D"font-size:small"><br style=3D"font-size:small=\r\n"><span style=3D"font-size:small">Onderwerp: Nog een test</span><br style=\r\n=3D"font-size:small"><br style=3D"font-size:small"><span style=3D"font-size=\r\n:small">Bericht:</span><br style=3D"font-size:small"><span style=3D"font-si=\r\nze:small">Bericht</span><br></div></div></div></div></div></div></div></div=\r\n></div>\r\n\r\n--00000000000042e6ae05cf2e5c7e--'), b')']
class=3D"gmail-il" style=3D"font=\r\n-size:small">Contactformulier</span><span style=3D"font-size:small">=C2=A0i=\r\nngevuld door:</span><br style=3D"font-size:small"><span style=3D"font-size:=\r\nsmall">Naam: Patrick Merkx</span><br style=3D"font-size:small"><span style=\r\n=3D"font-size:small">Email:=C2=A0</span><a href=3D"mailto:merkx.patrick@gma=\r\nil.com" target=3D"_blank" style=3D"font-size:small">merkx.patrick@gmail.com=\r\n</a><br style=3D"font-size:small"><span style=3D"font-size:small">Tel nr.: =\r\n0611381219</span><br style=3D"font-size:small"><br style=3D"font-size:small=\r\n"><span style=3D"font-size:small">Onderwerp: Nog een test</span><br style=\r\n=3D"font-size:small"><br style=3D"font-size:small"><span style=3D"font-size=\r\n:small">Bericht:</span><br style=3D"font-size:small"><span style=3D"font-si=\r\nze:small">Bericht</span><br>
我也尝试过剥离 body 标签,解码,也尝试过多种解决方案,但到目前为止不走运。我似乎无法以任何已知的方式删除这些换行符。
我做错了什么?
您正在查看 Content-Transfer-Encoding: quoted-printable
的 MIME 部分。解码的正确方法是遍历 MIME 结构并在进行时解释部分。但是没有必要明确地这样做; Python 的 email
库已经为您做到了。
from email import message_from_bytes
from email.policy import default
...
msg_ids = get_emails(search('SUBJECT', 'TESTTITELPYTHON', con))
for msg in msg_ids[::-1]:
for sent in msg:
if type(sent) is tuple:
msg = message_from_bytes(sent[1], policy=default)
遗憾的是,如果没有这些邮件中 MIME 结构的示例,我无法确切地告诉您如何处理生成的邮件。可能您有类似“主要” MIME body 部分的东西; msg.get_body(preferencelist=('html', 'plain'))
会提取出来,get_content()
结果会提取实际的 body 部分。
policy=default
关键字参数选择了 email.message.EmailMessage
object class ,它在 Python 3.6 中引入了遗留 email.message.Message
object 来自旧版本。
更详细地说,尝试将原始电子邮件正文解码为 UTF-8 是非常错误的。典型的 MIME 消息有几个部分,每个部分可能有不同的编码,其中许多肯定不使用 UTF-8 作为它们的编码(尽管它变得越来越普遍;但是通常,实际的 UTF-8 将是在内容传输编码之后,保护它在通过可能不干净的 8 位路由传输期间免受损坏。
我正在尝试制作一个电子邮件抓取工具,用于抓取某些电子邮件以查找值以将它们存储在 CSV 文件中。我一直在尝试很多事情来解决这个问题,但到目前为止都没有成功。
# Function to get email content part i.e its body part
def get_body(msg):
if msg.is_multipart():
return get_body(msg.get_payload(decode=True)).decode()
else:
return msg.get_payload(decode=True).decode()
# Function to search for a key value pair
def search(key, value, con):
result, data = con.search(None, key, '"{}"'.format(value))
return data
# Function to get the list of emails under this label
def get_emails(result_bytes):
print("get email")
msgs = [] # all the email data are pushed inside an array
for num in result_bytes[0].split():
typ, data = con.fetch(num, '(RFC822)')
msgs.append(data)
return msgs
# this is done to make SSL connection with GMAIL
con = imaplib.IMAP4_SSL(imap_url)
con.login(user, password)
con.select('Inbox')
msg_ids = get_emails(search('SUBJECT', 'TESTTITELPYTHON', con))
for msg in msg_ids[::-1]:
for sent in msg:
if type(sent) is tuple:
print(msg)
# encoding set as utf-8
content = sent[1], 'utf-8'
data = str(content)
# Handling errors related to unicodenecode
try:
indexstart = data.find("span")
data2 = data[indexstart + 5: len(data)]
indexend = data2.find("</div>")
# printtng the required content which we need
# to extract from our email i.e our body
waarde = data2[0: indexend]
test_naam_1 = waarde.split("Naam: ",1)[1]
echte_naam = test_naam_1.split("Email: ",-1)[0]
email_test = waarde.split("Email: ",1)[1]
echte_email = email_test.split("Tel nr.: ",-1)[0]
tel_test = waarde.split("Tel nr.: ",1)[1]
echte_tel = tel_test.split("Onderwerp: ",-1)[0]
subj_test = waarde.split("Onderwerp: ",1)[1]
echte_subj = subj_test.split("Bericht: ",-1)[0]
print("---ADRESGEGEVENS---")
print("---Naam: " + echte_naam + "---")
print("---Naam: " + echte_email + "---")
print("---Naam: " + echte_tel + "---")
print("---Naam: " + echte_subj + "---")
现在在我的结果中,我仍然收到这些丑陋的换行符,它们在我的标记中如下所示:
[(b'12638 (RFC822 {1973}', b'MIME-Version: 1.0\r\nDate: Mon, 25 Oct 2021 16:41:46 +0200\r\nMessage-ID: <CAJDn=xsVynQqp7BwYoGZB=v21-AAR5=xcMkQ8D2kXE7ZpYFNNQ@mail.example.com>\r\nSubject: TESTTITELPYTHON\r\nFrom: Patrick Merkx <patrick@example.nl>\r\nTo: Patrick Merkx <patrick@example.nl>\r\nContent-Type: multipart/alternative; boundary="00000000000042e6ae05cf2e5c7e"\r\n\r\n--00000000000042e6ae05cf2e5c7e\r\nContent-Type: text/plain; charset="UTF-8"\r\n\r\nContactformulier ingevuld door:\r\nNaam: Patrick Merkx\r\nEmail: merkx.patrick@example.com\r\nTel nr.: 0611381219\r\n\r\nOnderwerp: Nog een test\r\n\r\nBericht:\r\nBericht\r\n\r\n--00000000000042e6ae05cf2e5c7e\r\nContent-Type: text/html; charset="UTF-8"\r\nContent-Transfer-Encoding: quoted-printable\r\n\r\n<div dir=3D"ltr"><div><div dir=3D"ltr" class=3D"gmail_signature" data-smart=\r\nmail=3D"gmail_signature"><div dir=3D"ltr"><div><div dir=3D"ltr"><div><div d=\r\nir=3D"ltr"><div style=3D"font-stretch:normal;font-size:13.33px;line-height:=\r\n19.99px;background:none;border:0px rgb(34,34,34);width:600px;overflow:visib=\r\nle;min-height:0px;outline-width:0px"><span class=3D"gmail-il" style=3D"font=\r\n-size:small">Contactformulier</span><span style=3D"font-size:small">=C2=A0i=\r\nngevuld door:</span><br style=3D"font-size:small"><span style=3D"font-size:=\r\nsmall">Naam: Patrick Merkx</span><br style=3D"font-size:small"><span style=\r\n=3D"font-size:small">Email:=C2=A0</span><a href=3D"mailto:merkx.patrick@gma=\r\nil.com" target=3D"_blank" style=3D"font-size:small">merkx.patrick@example.com=\r\n</a><br style=3D"font-size:small"><span style=3D"font-size:small">Tel nr.: =\r\n0611381219</span><br style=3D"font-size:small"><br style=3D"font-size:small=\r\n"><span style=3D"font-size:small">Onderwerp: Nog een test</span><br style=\r\n=3D"font-size:small"><br style=3D"font-size:small"><span style=3D"font-size=\r\n:small">Bericht:</span><br style=3D"font-size:small"><span style=3D"font-si=\r\nze:small">Bericht</span><br></div></div></div></div></div></div></div></div=\r\n></div>\r\n\r\n--00000000000042e6ae05cf2e5c7e--'), b')']
class=3D"gmail-il" style=3D"font=\r\n-size:small">Contactformulier</span><span style=3D"font-size:small">=C2=A0i=\r\nngevuld door:</span><br style=3D"font-size:small"><span style=3D"font-size:=\r\nsmall">Naam: Patrick Merkx</span><br style=3D"font-size:small"><span style=\r\n=3D"font-size:small">Email:=C2=A0</span><a href=3D"mailto:merkx.patrick@gma=\r\nil.com" target=3D"_blank" style=3D"font-size:small">merkx.patrick@gmail.com=\r\n</a><br style=3D"font-size:small"><span style=3D"font-size:small">Tel nr.: =\r\n0611381219</span><br style=3D"font-size:small"><br style=3D"font-size:small=\r\n"><span style=3D"font-size:small">Onderwerp: Nog een test</span><br style=\r\n=3D"font-size:small"><br style=3D"font-size:small"><span style=3D"font-size=\r\n:small">Bericht:</span><br style=3D"font-size:small"><span style=3D"font-si=\r\nze:small">Bericht</span><br>
我也尝试过剥离 body 标签,解码,也尝试过多种解决方案,但到目前为止不走运。我似乎无法以任何已知的方式删除这些换行符。
我做错了什么?
您正在查看 Content-Transfer-Encoding: quoted-printable
的 MIME 部分。解码的正确方法是遍历 MIME 结构并在进行时解释部分。但是没有必要明确地这样做; Python 的 email
库已经为您做到了。
from email import message_from_bytes
from email.policy import default
...
msg_ids = get_emails(search('SUBJECT', 'TESTTITELPYTHON', con))
for msg in msg_ids[::-1]:
for sent in msg:
if type(sent) is tuple:
msg = message_from_bytes(sent[1], policy=default)
遗憾的是,如果没有这些邮件中 MIME 结构的示例,我无法确切地告诉您如何处理生成的邮件。可能您有类似“主要” MIME body 部分的东西; msg.get_body(preferencelist=('html', 'plain'))
会提取出来,get_content()
结果会提取实际的 body 部分。
policy=default
关键字参数选择了 email.message.EmailMessage
object class ,它在 Python 3.6 中引入了遗留 email.message.Message
object 来自旧版本。
更详细地说,尝试将原始电子邮件正文解码为 UTF-8 是非常错误的。典型的 MIME 消息有几个部分,每个部分可能有不同的编码,其中许多肯定不使用 UTF-8 作为它们的编码(尽管它变得越来越普遍;但是通常,实际的 UTF-8 将是在内容传输编码之后,保护它在通过可能不干净的 8 位路由传输期间免受损坏。