使用 mbox Python 模块解码和访问 mbox 文件
Decode and access mbox file with mbox Python mdule
我需要将电子邮件数据库迁移到 CRM,并且有 2 个问题:
我可以访问 mbox 文件,但内容未正确解码。
我想创建一个类似结构的数据框,其中包含以下列:“日期、从、到、主题、正文”
我试过以下方法:
for i, message in enumerate(mbox):
print("from :",message['from'])
print("subject:",message['subject'])
if message.is_multipart():
content = (part.get_payload(decode=True) for part in message.get_payload())
else:
content = message.get_payload(decode=True)
print("content:",content)
print("**************************************")
if i == 10:
break
并得到以下输出:
from : =?UTF-8?Q?Gonzalo_Gasset_Yba=C3=B1ez?= <gonzalo.gasset@baud.es>
subject: =?UTF-8?Q?Marqu=C3=A9s_de_Vargas_=26_Baud?=
content: <generator object <genexpr> at 0x7fe025f3a350>
**************************************
from : Mailtrack Reminder <reminders@mailtrack.io>
subject: Re: Presupuesto de Logotipo y =?utf-8?Q?Dise=C3=B1o?= Corporativo
para nuevo proyecto
content: b'<!DOCTYPE html>\r\n<html>\r\n<head>\r\n <meta charset="utf-8">\r\n <meta name="viewport" content="width=device-width">\r\n <title>Reminder</title>\r\n</head>\r\n<style media="screen">\r\n body {\r\n font-family: Helvetica;\r\n }\r\n</style>\r\n<body style="background-color: #f6f6f6; -webkit-font-smoothing: antialiased; font-size: 14px; line-height: 1.4; margin: 0; padding: 0; .....
mailbox.Mailbox accept a factory
argument that can be used to build messages. By passing the parse method of a BytesParser initialised with the default policy we can generate EmailMessage 的具体实现将自动解码 headers 和 body 文本。
选择实际的 body 比较棘手,可能取决于您的特定要求。在下面的代码示例中,任何“文本”类型的部分都被连接在一起,而 non-text 部分被拒绝。您可能希望应用自己的选择标准。
from email.parser import BytesParser
from email.policy import default
import mailbox
mbox = mailbox.mbox(path_to_mailbox, factory=BytesParser(policy=default).parse)
for _, message in enumerate(mbox):
print("date: :", message['date'])
print("to: :", message['to'])
print("from :", message['from'])
print("subject:", message['subject'])
if message.is_multipart():
contents = []
for part in message.walk():
maintype = part.get_content_maintype()
if maintype == 'multipart' or maintype != 'text':
# Reject containers and non-text types
continue
contents.append(part.get_content())
content = '\n\n'.join(contents)
else:
content = message.get_content()
print("content:", content)
print("**************************************")
我需要将电子邮件数据库迁移到 CRM,并且有 2 个问题:
我可以访问 mbox 文件,但内容未正确解码。
我想创建一个类似结构的数据框,其中包含以下列:“日期、从、到、主题、正文”
我试过以下方法:
for i, message in enumerate(mbox):
print("from :",message['from'])
print("subject:",message['subject'])
if message.is_multipart():
content = (part.get_payload(decode=True) for part in message.get_payload())
else:
content = message.get_payload(decode=True)
print("content:",content)
print("**************************************")
if i == 10:
break
并得到以下输出:
from : =?UTF-8?Q?Gonzalo_Gasset_Yba=C3=B1ez?= <gonzalo.gasset@baud.es>
subject: =?UTF-8?Q?Marqu=C3=A9s_de_Vargas_=26_Baud?=
content: <generator object <genexpr> at 0x7fe025f3a350>
**************************************
from : Mailtrack Reminder <reminders@mailtrack.io>
subject: Re: Presupuesto de Logotipo y =?utf-8?Q?Dise=C3=B1o?= Corporativo
para nuevo proyecto
content: b'<!DOCTYPE html>\r\n<html>\r\n<head>\r\n <meta charset="utf-8">\r\n <meta name="viewport" content="width=device-width">\r\n <title>Reminder</title>\r\n</head>\r\n<style media="screen">\r\n body {\r\n font-family: Helvetica;\r\n }\r\n</style>\r\n<body style="background-color: #f6f6f6; -webkit-font-smoothing: antialiased; font-size: 14px; line-height: 1.4; margin: 0; padding: 0; .....
mailbox.Mailbox accept a factory
argument that can be used to build messages. By passing the parse method of a BytesParser initialised with the default policy we can generate EmailMessage 的具体实现将自动解码 headers 和 body 文本。
选择实际的 body 比较棘手,可能取决于您的特定要求。在下面的代码示例中,任何“文本”类型的部分都被连接在一起,而 non-text 部分被拒绝。您可能希望应用自己的选择标准。
from email.parser import BytesParser
from email.policy import default
import mailbox
mbox = mailbox.mbox(path_to_mailbox, factory=BytesParser(policy=default).parse)
for _, message in enumerate(mbox):
print("date: :", message['date'])
print("to: :", message['to'])
print("from :", message['from'])
print("subject:", message['subject'])
if message.is_multipart():
contents = []
for part in message.walk():
maintype = part.get_content_maintype()
if maintype == 'multipart' or maintype != 'text':
# Reject containers and non-text types
continue
contents.append(part.get_content())
content = '\n\n'.join(contents)
else:
content = message.get_content()
print("content:", content)
print("**************************************")