使用邮箱访问 mbox 中的所有字段
Access all fields in mbox using mailbox
我正在尝试对 mbox 格式的电子邮件执行一些处理。
经过搜索,尝试了一些试错https://docs.python.org/3/library/mailbox.html#mbox
我已经使用下面列出的测试代码完成了大部分我想做的事情(即使我必须编写代码来解码主题)。
我发现这有点碰运气,特别是查找字段所需的键 'subject' 似乎是反复试验,我似乎找不到任何方法来列出候选人信息。 (我知道这些字段可能因电子邮件而异。)
谁能帮我列出可能的值?
我还有一个问题;该电子邮件可能包含多个“已收到:”字段,例如
Received: from awcp066.server-cpanel.com
Received: from mail116-213.us2.msgfocus.com ([185.187.116.213]:60917)
by awcp066.server-cpanel.com with esmtps (TLSv1.2:ECDHE-RSA-AES256-GCM-SHA384:256)
我有兴趣按时间顺序访问第一个 - 我很乐意搜索,但似乎找不到任何方法来访问文件中的第一个。
#! /usr/bin/env python3
#import locale
#2020-08-31
"""
Extract Subject from MBOX file
"""
import os, time
import mailbox
import base64, quopri
def isbqencoded(s):
"""
Test if Base64 or Quoted Printable strings
"""
return s.upper().startswith('=?UTF-8?')
def bqdecode(s):
"""
Convert UTF-8 Base64 or Quoted Printable string to str
"""
nd = s.find('?=', 10)
if s.upper().startswith('=?UTF-8?B?'): # Base64
bbb = base64.b64decode(s[10:nd])
elif s.upper().startswith('=?UTF-8?Q?'): # Quoted Printable
bbb = quopri.decodestring(s[10:nd])
return bbb.decode("utf-8")
def sdecode(s):
"""
Convert possibly multiline Base64 or Quoted Printable strings to str
"""
outstr = ""
if s is None:
return outstr
for ss in str(s).splitlines(): # split multiline strings
sss = ss.strip()
for sssp in sss.split(' '): # split multiple strings
if isbqencoded(sssp):
outstr += bqdecode(sssp)
else:
outstr += sssp
outstr+=' '
outstr = outstr.strip()
return outstr
INBOX = '~/temp/2020227_mbox'
print('Messages in ', INBOX)
mymail = mailbox.mbox(INBOX)
print('Values = ', mymail.values())
print('Keys = ', mymail.keys())
# print(mymail.items)
# for message in mailbox.mbox(INBOX):
for message in mymail:
# print(message)
subject = message['subject']
to = message['to']
id = message['id']
received = message['Received']
sender = message['from']
ddate = message['Delivery-date']
envelope = message['Envelope-to']
print(sdecode(subject))
print('To ', to)
print('Envelope ', envelope)
print('Received ', received)
print('Sender ', sender)
print('Delivery-date ', ddate)
# print('Received ', received[1])
在的基础上,我简化了Subject解码,得到了类似的结果。
我仍在寻找有关访问 Header 其余部分的建议 - 特别是如何访问多个“已收到:”字段。
#! /usr/bin/env python3
#import locale
#2020-09-02
"""
Extract Subject from MBOX file
"""
import os, time
import mailbox
from email.parser import BytesParser
from email.policy import default
INBOX = '~/temp/2020227_mbox'
print('Messages in ', INBOX)
mymail = mailbox.mbox(INBOX, factory=BytesParser(policy=default).parse)
for _, message in enumerate(mymail):
print("date: :", message['date'])
print("to: :", message['to'])
print("from :", message['from'])
print("subject:", message['subject'])
print('Received: ', message['received'])
print("**************************************")
邮件消息object提供了一个get_all方法returns一个header的所有实例,所以我们可以用它来获取接收到的所有值header.
for header in message.get_all('received'):
print('Received', header)
每个 header 都是 UnstructuredHeader 的一个实例。这对于识别最早收到的 header 不是很有帮助,因为需要解析 header 以提取日期以便对它们进行排序。
然而,根据 this answer, which quotes the RFC, received headers are always inserted at the beginning of the message. The docstring 对于 EmailMessage.get_all()
的说法:
Return a list of all the values for the named field.
These will be sorted in the order they appeared in the original
message, and may contain duplicates.
所以最早收到的 header 应该是 EmailMessage.get_all()
返回的列表中的最后一个 header。
根据 snakecharmerb 的评论(现已编辑到问题中)我简化了过程。
最后我不需要解码 received,因为 Message-ID 实际上提取了 id 来自原始的 received 字段。
我列出我最终使用的代码,以防对其他人有用。
此代码仅提取 header 个感兴趣的字段并打印它们,但完整代码对消息执行分析。
#! /usr/bin/env python3
#import locale
#2020-09-05
"""
Extract Message Header details from MBOX file
"""
import os, time
import mailbox
from email.parser import BytesParser
from email.policy import default
INBOX = '~/temp/Gmail'
print('Messages in ', INBOX)
mymail = mailbox.mbox(INBOX, factory=BytesParser(policy=default).parse)
for _, message in enumerate(mymail):
date = message['date']
to = message['to']
sender = message['from']
subject = message['subject']
messageID = message['Message-ID']
received = message['received']
deliveredTo = message['Delivered-To']
if(messageID == None): continue
print("Date :", date)
print("From :", sender)
print("To: :", to)
print('Delivered-To:', deliveredTo)
print("Subject :", subject)
print("Message-ID :", messageID)
# print('Received :', received)
print("**************************************")
我正在尝试对 mbox 格式的电子邮件执行一些处理。
经过搜索,尝试了一些试错https://docs.python.org/3/library/mailbox.html#mbox
我已经使用下面列出的测试代码完成了大部分我想做的事情(即使我必须编写代码来解码主题)。
我发现这有点碰运气,特别是查找字段所需的键 'subject' 似乎是反复试验,我似乎找不到任何方法来列出候选人信息。 (我知道这些字段可能因电子邮件而异。)
谁能帮我列出可能的值?
我还有一个问题;该电子邮件可能包含多个“已收到:”字段,例如
Received: from awcp066.server-cpanel.com
Received: from mail116-213.us2.msgfocus.com ([185.187.116.213]:60917)
by awcp066.server-cpanel.com with esmtps (TLSv1.2:ECDHE-RSA-AES256-GCM-SHA384:256)
我有兴趣按时间顺序访问第一个 - 我很乐意搜索,但似乎找不到任何方法来访问文件中的第一个。
#! /usr/bin/env python3
#import locale
#2020-08-31
"""
Extract Subject from MBOX file
"""
import os, time
import mailbox
import base64, quopri
def isbqencoded(s):
"""
Test if Base64 or Quoted Printable strings
"""
return s.upper().startswith('=?UTF-8?')
def bqdecode(s):
"""
Convert UTF-8 Base64 or Quoted Printable string to str
"""
nd = s.find('?=', 10)
if s.upper().startswith('=?UTF-8?B?'): # Base64
bbb = base64.b64decode(s[10:nd])
elif s.upper().startswith('=?UTF-8?Q?'): # Quoted Printable
bbb = quopri.decodestring(s[10:nd])
return bbb.decode("utf-8")
def sdecode(s):
"""
Convert possibly multiline Base64 or Quoted Printable strings to str
"""
outstr = ""
if s is None:
return outstr
for ss in str(s).splitlines(): # split multiline strings
sss = ss.strip()
for sssp in sss.split(' '): # split multiple strings
if isbqencoded(sssp):
outstr += bqdecode(sssp)
else:
outstr += sssp
outstr+=' '
outstr = outstr.strip()
return outstr
INBOX = '~/temp/2020227_mbox'
print('Messages in ', INBOX)
mymail = mailbox.mbox(INBOX)
print('Values = ', mymail.values())
print('Keys = ', mymail.keys())
# print(mymail.items)
# for message in mailbox.mbox(INBOX):
for message in mymail:
# print(message)
subject = message['subject']
to = message['to']
id = message['id']
received = message['Received']
sender = message['from']
ddate = message['Delivery-date']
envelope = message['Envelope-to']
print(sdecode(subject))
print('To ', to)
print('Envelope ', envelope)
print('Received ', received)
print('Sender ', sender)
print('Delivery-date ', ddate)
# print('Received ', received[1])
在
我仍在寻找有关访问 Header 其余部分的建议 - 特别是如何访问多个“已收到:”字段。
#! /usr/bin/env python3
#import locale
#2020-09-02
"""
Extract Subject from MBOX file
"""
import os, time
import mailbox
from email.parser import BytesParser
from email.policy import default
INBOX = '~/temp/2020227_mbox'
print('Messages in ', INBOX)
mymail = mailbox.mbox(INBOX, factory=BytesParser(policy=default).parse)
for _, message in enumerate(mymail):
print("date: :", message['date'])
print("to: :", message['to'])
print("from :", message['from'])
print("subject:", message['subject'])
print('Received: ', message['received'])
print("**************************************")
邮件消息object提供了一个get_all方法returns一个header的所有实例,所以我们可以用它来获取接收到的所有值header.
for header in message.get_all('received'):
print('Received', header)
每个 header 都是 UnstructuredHeader 的一个实例。这对于识别最早收到的 header 不是很有帮助,因为需要解析 header 以提取日期以便对它们进行排序。
然而,根据 this answer, which quotes the RFC, received headers are always inserted at the beginning of the message. The docstring 对于 EmailMessage.get_all()
的说法:
Return a list of all the values for the named field. These will be sorted in the order they appeared in the original message, and may contain duplicates.
所以最早收到的 header 应该是 EmailMessage.get_all()
返回的列表中的最后一个 header。
根据 snakecharmerb 的评论(现已编辑到问题中)我简化了过程。
最后我不需要解码 received,因为 Message-ID 实际上提取了 id 来自原始的 received 字段。
我列出我最终使用的代码,以防对其他人有用。 此代码仅提取 header 个感兴趣的字段并打印它们,但完整代码对消息执行分析。
#! /usr/bin/env python3
#import locale
#2020-09-05
"""
Extract Message Header details from MBOX file
"""
import os, time
import mailbox
from email.parser import BytesParser
from email.policy import default
INBOX = '~/temp/Gmail'
print('Messages in ', INBOX)
mymail = mailbox.mbox(INBOX, factory=BytesParser(policy=default).parse)
for _, message in enumerate(mymail):
date = message['date']
to = message['to']
sender = message['from']
subject = message['subject']
messageID = message['Message-ID']
received = message['received']
deliveredTo = message['Delivered-To']
if(messageID == None): continue
print("Date :", date)
print("From :", sender)
print("To: :", to)
print('Delivered-To:', deliveredTo)
print("Subject :", subject)
print("Message-ID :", messageID)
# print('Received :', received)
print("**************************************")