从 mbox 写入 html 文件
Write html file from mbox
在 Yahoo 群组关闭之前,您可以将群组的内容下载到 mbox 文件中。我正在尝试将 mbox 文件转换为一系列 html 文件 - 每条消息一个。我的问题是处理 html 中的编码和特殊字符。这是我的尝试:
import mailbox
the_dir = "/path/to/file"
mbox = mailbox.mbox(the_dir + "12394334.mbox")
html_header = """<!DOCTYPE html>
<html>
<head>
<title>Email message</title>
</head>
<body>"""
html_footer = '</body></html>'
for message in mbox:
mess_from = message['from']
subject = message['subject']
time_received = message['date']
if message.is_multipart():
content = ''.join(str(part.get_payload(decode=True)) for part in message.get_payload())
else:
content = message.get_payload(decode=True)
content = str(content)[2:].replace('\n', '<br/>')
subject.replace('/', '-')
fname = subject + " " + time_received + '.html'
with open(the_dir + 'html/' + fname , 'w') as the_file:
the_file.write(html_header)
the_file.write('<br/>' + 'From: ' + mess_from)
the_file.write('<br/>' + 'Subject: ' + subject)
the_file.write('<br/>' + 'Received: ' + time_received + '<br/><br/>')
the_file.write(content)
邮件内容在撇号和其他特殊字符之前有反斜杠,例如:
star rating, currently going for \xa311.99 [ideal Xmas present].
Advert over - Seroiusly, if you don't have a decent book on small boat
我的问题是,获取电子邮件内容并使用正确字符将其写入 html 文件的最佳方法是什么。我不是第一个 运行 解决这个问题的人。
我找到了这个问题的答案。
首先,我需要通过子类型 (part.get_content_subtype()) 来识别 html。这就是我知道我有一个 html 子类型的方式。
然后我需要使用 part.get_charsets() 获取字符集。有一个 part.get_charset() 但它总是 returns None 所以我取 get_charsets()
的第一个元素
get_payload 似乎是 bass ackward,decode=True 参数意味着它不会解码有效负载。然后,我使用之前获得的字符集对消息进行解码。否则,我用 decode=False.
解码它
如果是文本,我去掉换行符等并添加一个 html header 然后写入文件。
下一份工作,
- 使用 BeautifulSoup 将发件人 info/subject 添加到
- 弄清楚如何处理附件和 link html 文件
- 有些字符仍然没有显示,例如“£”等
文字
import mailbox
the_dir = "/path/to/mbox/"
mbox = mailbox.mbox(the_dir + "12394334.mbox")
html_footer = "</body></html>"
html_flag = False
for message in mbox:
mess_from = message['from']
subject = message['subject']
time_received = message['date']
fname = subject + " " + time_received
fname = fname.replace('/', '-')
if message.is_multipart():
contents_text = []
contents_html = []
for part in message.walk():
maintype = part.get_content_maintype()
subtype = part.get_content_subtype()
if maintype == 'multipart' or maintype == 'message':
# Reject containers
continue
if subtype == 'html':
enc = part.get_charsets()
if enc[0] is not None:
contents_html.append(part.get_payload(decode=True).decode(enc[0]))
else:
contents_html.append(part.get_payload(decode=False))
elif subtype == 'text':
contents_text.append(part.get_payload(decode=False))
else: #I will use this to process attachmnents in the future
continue
if len(contents_html)> 0:
if len(contents_html)>1:
print('multiple html') #This hasn't happened yet
html_flag = True
content = '\n\n'.join(contents_html)
else:
html_flag = False
else:
content = message.get_payload(decode=False)
content = content.replace('\n', '<br/>')
content = content.replace('=\n', '<br/>')
content = content.replace('\n', '<br/>')
content = content.replace('=20', '')
html_header = f""" <!DOCTYPE html>
<html>
<head>
<title>{fname}</title>
</head>
<body>"""
content = (html_header + '<br/>' +
'From: ' + mess_from + '<br/>'
+ 'Subject: ' + subject + '<br/>' +
'Received: ' + time_received + '<br/><br/>' +
content + html_footer)
with open(the_dir + "html/" + fname + ".html", "w") as the_file:
the_file.write(content)
打印('Done!')
在 Yahoo 群组关闭之前,您可以将群组的内容下载到 mbox 文件中。我正在尝试将 mbox 文件转换为一系列 html 文件 - 每条消息一个。我的问题是处理 html 中的编码和特殊字符。这是我的尝试:
import mailbox
the_dir = "/path/to/file"
mbox = mailbox.mbox(the_dir + "12394334.mbox")
html_header = """<!DOCTYPE html>
<html>
<head>
<title>Email message</title>
</head>
<body>"""
html_footer = '</body></html>'
for message in mbox:
mess_from = message['from']
subject = message['subject']
time_received = message['date']
if message.is_multipart():
content = ''.join(str(part.get_payload(decode=True)) for part in message.get_payload())
else:
content = message.get_payload(decode=True)
content = str(content)[2:].replace('\n', '<br/>')
subject.replace('/', '-')
fname = subject + " " + time_received + '.html'
with open(the_dir + 'html/' + fname , 'w') as the_file:
the_file.write(html_header)
the_file.write('<br/>' + 'From: ' + mess_from)
the_file.write('<br/>' + 'Subject: ' + subject)
the_file.write('<br/>' + 'Received: ' + time_received + '<br/><br/>')
the_file.write(content)
邮件内容在撇号和其他特殊字符之前有反斜杠,例如:
star rating, currently going for \xa311.99 [ideal Xmas present]. Advert over - Seroiusly, if you don't have a decent book on small boat
我的问题是,获取电子邮件内容并使用正确字符将其写入 html 文件的最佳方法是什么。我不是第一个 运行 解决这个问题的人。
我找到了这个问题的答案。
首先,我需要通过子类型 (part.get_content_subtype()) 来识别 html。这就是我知道我有一个 html 子类型的方式。
然后我需要使用 part.get_charsets() 获取字符集。有一个 part.get_charset() 但它总是 returns None 所以我取 get_charsets()
的第一个元素get_payload 似乎是 bass ackward,decode=True 参数意味着它不会解码有效负载。然后,我使用之前获得的字符集对消息进行解码。否则,我用 decode=False.
解码它如果是文本,我去掉换行符等并添加一个 html header 然后写入文件。
下一份工作,
- 使用 BeautifulSoup 将发件人 info/subject 添加到
- 弄清楚如何处理附件和 link html 文件
- 有些字符仍然没有显示,例如“£”等
文字
import mailbox
the_dir = "/path/to/mbox/"
mbox = mailbox.mbox(the_dir + "12394334.mbox")
html_footer = "</body></html>"
html_flag = False
for message in mbox:
mess_from = message['from']
subject = message['subject']
time_received = message['date']
fname = subject + " " + time_received
fname = fname.replace('/', '-')
if message.is_multipart():
contents_text = []
contents_html = []
for part in message.walk():
maintype = part.get_content_maintype()
subtype = part.get_content_subtype()
if maintype == 'multipart' or maintype == 'message':
# Reject containers
continue
if subtype == 'html':
enc = part.get_charsets()
if enc[0] is not None:
contents_html.append(part.get_payload(decode=True).decode(enc[0]))
else:
contents_html.append(part.get_payload(decode=False))
elif subtype == 'text':
contents_text.append(part.get_payload(decode=False))
else: #I will use this to process attachmnents in the future
continue
if len(contents_html)> 0:
if len(contents_html)>1:
print('multiple html') #This hasn't happened yet
html_flag = True
content = '\n\n'.join(contents_html)
else:
html_flag = False
else:
content = message.get_payload(decode=False)
content = content.replace('\n', '<br/>')
content = content.replace('=\n', '<br/>')
content = content.replace('\n', '<br/>')
content = content.replace('=20', '')
html_header = f""" <!DOCTYPE html>
<html>
<head>
<title>{fname}</title>
</head>
<body>"""
content = (html_header + '<br/>' +
'From: ' + mess_from + '<br/>'
+ 'Subject: ' + subject + '<br/>' +
'Received: ' + time_received + '<br/><br/>' +
content + html_footer)
with open(the_dir + "html/" + fname + ".html", "w") as the_file:
the_file.write(content)
打印('Done!')