解析 multipart/related 封电子邮件
Parse multipart/related emails
我正在尝试解析电子邮件并将其中的表格转换为 pandas 数据帧。
由于一些电子邮件是多部分的,因此我从 .
中提取了一些代码
以下代码工作正常,但它因 multipart/related 封电子邮件而中断(未找到表格)。
HOST = 'imap.gmail.com'
m = imaplib.IMAP4_SSL(HOST, 993)
m.login(USERNAME, PASSWORD)
m.select('Inbox')
result, data = m.uid('search', None, "UNSEEN", '(FROM "xxx@xxx.xxx")')
print(result)
if result == 'OK':
for num in data[0].split()[:]:
result, data = m.uid('fetch', num, '(RFC822)')
if result == 'OK':
email_message = email.message_from_bytes(data[0][1])
b = email_message
body = ""
print(b.is_multipart())
if b.is_multipart():
for part in b.walk():
ctype = part.get_content_type()
cdispo = str(part.get('Content-Disposition'))
# skip any text/plain (txt) attachments
if ctype == 'text/plain' and 'attachment' not in cdispo:
body = part.get_payload(decode=True) # decode
break
else:
body = b.get_payload(decode=True)
soup = BeautifulSoup(body)
table = soup.find_all('table')
df = pd.read_html(str(table))[0]
display(df)
这是 multipart/related 一封电子邮件的 header:
Delivered-To: xxxxxx@gmail.com
Received: by 2002:a05:6a10:cc86:0:0:0:0 with SMTP id gj6csp6140432pxb;
Mon, 27 Dec 2021 14:52:14 -0800 (PST)
X-Google-Smtp-Source: ABdhPJxPtKdKdVFNfgIE5xJdGrqDvekcD9MVkXdJaQyjJcVjc63N0KmOSN1LKvqLDbzssUU+6xjG
X-Received: by 2002:a05:620a:1132:: with SMTP id p18mr13912209qkk.778.1640645534051;
Mon, 27 Dec 2021 14:52:14 -0800 (PST)
ARC-Seal: i=1; a=rsa-sha256; t=1640645534; cv=none;
d=google.com; s=arc-20160816;
b=JUwqNu9ZFFy3j5ke7GddEIhpUGSdzB0gby+k5PFr3AwQv+/JtDY6p9ksOhReeFkQpd
2rNOhn9HknPnVpu1s+S9BT+YIrKWo8jrCzqJRWkaiY7MN80BGjw+oSkoD+WTNoo9rk7t
ojil3vIatY02Unl5FfYlOUxZbFZ7Xb3xT44Zd9lRI7aQNrLZxSjeQAF/oL+N8eE0rMXo
T5McU5R165sEb81twUpHrSkbp34/v31W25kOwx68Mb7hkuOTv/komZiQy1oiP+xzUKDH
CxKOgF/UgzVD5mhyB6DSSEN22DQ4ybrmshmd+B5wugSVlY9hfw0t89kJQGChKUphk9GH
/VWw==
ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816;
h=feedback-id:mime-version:date:message-id:subject:to:from
:dkim-signature:dkim-signature;
bh=iqw+mlksCZlkG8lxD5rVcYUL5uh/jJYU8nLc+GpCr/4=;
b=qnu0Xb2/dj8zwtelmnry7/okDbUj4QpsNPtWtovwrbtlDIpnSS8HRq4qzVzUy6TDFE
flm0XO489XNMO/GJ8Jw0J5Duujhnto3PiBRrAtIcA4CXkKhRe3SpXYk7D+PjROg+Zngk
5lqA9RgxerLMq+wMRD4WlcZVuWmmUtBhY/T9XbXOXUlJJJa9qn6AlKNOp5ZV8CDxweTp
yCDuQpJSCrbp1mldDe3N6lQAUXfaoGIBu6Kv7hpdZHwdrNMIeuhyCHTI4JF1IV0lK+G0
DzJg76RxnRQ3q0eacW9X/hzbMLZeljxfUO18BeDzRp45i3XqVyVsC53TirpmYv7OcB50
MaWA==
ARC-Authentication-Results: i=1; mx.google.com;
dkim=pass header.i=@xxxxxx.com header.s=xdzpvx2vm2fr73bjeppds7oqr3jbfy5s header.b=BATglTQY;
dkim=pass header.i=@amazonses.com header.s=ug7nbtf4gccmlpwj322ax3p6ow6yfsug header.b=YLjw7lGE;
spf=pass (google.com: domain of 0100017dfe181d85-07cbd269-a94a-4b30-8d52-1e4e48a34639-000000@amazonses.xxxxxx.com designates 54.240.11.40 as permitted sender) smtp.mailfrom=0100017dfe181d85-07cbd269-a94a-4b30-8d52-1e4e48a34639-000000@amazonses.xxxxxx.com;
dmarc=pass (p=REJECT sp=NONE dis=NONE) header.from=xxxxxx.com
Return-Path: <0100017dfe181d85-07cbd269-a94a-4b30-8d52-1e4e48a34639-000000@amazonses.xxxxxx.com>
Received: from a11-40.smtp-out.amazonses.com (a11-40.smtp-out.amazonses.com. [54.240.11.40])
by mx.google.com with ESMTPS id g19si4414275qtm.154.2021.12.27.14.52.13
for <xxxxxx@gmail.com>
(version=TLS1_2 cipher=ECDHE-ECDSA-AES128-GCM-SHA256 bits=128/128);
Mon, 27 Dec 2021 14:52:14 -0800 (PST)
Received-SPF: pass (google.com: domain of 0100017dfe181d85-07cbd269-a94a-4b30-8d52-1e4e48a34639-000000@amazonses.xxxxxx.com designates 54.240.11.40 as permitted sender) client-ip=54.240.11.40;
Authentication-Results: mx.google.com;
dkim=pass header.i=@xxxxxx.com header.s=xdzpvx2vm2fr73bjeppds7oqr3jbfy5s header.b=BATglTQY;
dkim=pass header.i=@amazonses.com header.s=ug7nbtf4gccmlpwj322ax3p6ow6yfsug header.b=YLjw7lGE;
spf=pass (google.com: domain of 0100017dfe181d85-07cbd269-a94a-4b30-8d52-1e4e48a34639-000000@amazonses.xxxxxx.com designates 54.240.11.40 as permitted sender) smtp.mailfrom=0100017dfe181d85-07cbd269-a94a-4b30-8d52-1e4e48a34639-000000@amazonses.xxxxxx.com;
dmarc=pass (p=REJECT sp=NONE dis=NONE) header.from=xxxxxx.com
DKIM-Signature: v=1; a=rsa-sha256; q=dns/txt; c=relaxed/simple; s=xdzpvx2vm2fr73bjeppds7oqr3jbfy5s; d=xxxxxx.com; t=1640645533; h=Content-Type:From:To:Subject:Message-ID:Date:MIME-Version; bh=y7l8Len/FG0elemUfgWg28W0SEj5eOJRIMBt9xIFrQo=; b=BATglTQY6PkcRChCgrX9BMdkZVwppc3CCPZ2QliEN6VGtr4YxW7l0C1n3mMgeRCL 0fXjKZwX3enRf9cHfKFJQErkxlmUfyKkLbtKJ4xNd78r4D04aCgUBRgovY05e2lE2vq KZEiJhF7oUN+QyxE87GahoQ88S/7cVjVVIh0RSHQ=
DKIM-Signature: v=1; a=rsa-sha256; q=dns/txt; c=relaxed/simple; s=ug7nbtf4gccmlpwj322ax3p6ow6yfsug; d=amazonses.com; t=1640645533; h=Content-Type:From:To:Subject:Message-ID:Date:MIME-Version:Feedback-ID; bh=y7l8Len/FG0elemUfgWg28W0SEj5eOJRIMBt9xIFrQo=; b=YLjw7lGEYZH+SQ4mx1EEdMVAo2v0EzbKGyGHmzH1CkvlnMv9yjMn4x3/BYhpOTxm yZ532qDZBGIIUPkCjoKOAz6K6a11xzPBREIl8Bz0O0kJyEcoShGahRbY4bgNCkOocx8 IJD+NREMTfVK6wlsxzoWRS+HAnVfg1pU80yORo7M=
Content-Type: multipart/related; type="text/html"; boundary="--_NmP-f890ebfb5c0d8a34-Part_1"
From: xxxxxx <noreply@xxxxxx.com>
To: xxxxxx@gmail.com
Subject: Watchlist Summary for Mon, December 27, 2021 (Futures)
Message-ID: <0100017dfe181d85-07cbd269-a94a-4b30-8d52-1e4e48a34639-000000@email.amazonses.com>
Date: Mon, 27 Dec 2021 22:52:13 +0000
MIME-Version: 1.0
Feedback-ID: 1.us-east-1.xy6STr9N8VtfY9IEmltVU/dtudHWlVMH37XgJn5/ROY=:AmazonSES
X-SES-Outgoing: 2021.12.27-54.240.11.40
----_NmP-f890ebfb5c0d8a34-Part_1
Content-Type: text/html; charset=utf-8
Content-Transfer-Encoding: quoted-printable
<!DOCTYPE html><html lang=3D"en"><head><meta charset=3D"UTF-8"><meta http-e=
quiv=3D"Content-Type" content=3D"text/html; charset=3DUTF-8"><meta http-equ=
iv=3D"X-UA-Compatible" content=3D"IE=3Dedge"><meta name=3D"viewport" conten=
t=3D"width=3Ddevice-width, initial-scale=3D1.0"><!-- So that mobile webkit =
will display zoomed in--><meta name=3D"format-detection" content=3D"telepho=
ne=3Dno"><!-- disable auto telephone linking in iOS--><title></title><style=
type=3D"text/css">}
.ad p {
margin-top: 4px;
}
</style><style type=3D"text/css">#data-table,
.data-table { max-width:100%; min-width:100%; width:100%; border-collapse:c=
ollapse; }
#data-table th,
#data-table td,
.data-table th,
.data-table td { color:#000000; border-collapse:collapse; padding:4px; whit=
e-space:nowrap; border:1px solid #D8D8D8; }
#data-table .body tr:nth-of-type(odd),
.data-table .body tr:nth-of-type(odd) { background-color:#f3f3f3; }
#data-table table tbody .spacer td,
.data-table table tbody .spacer td { border:none; }
.preHeaderHide { display:none !important; mso-hide:all !important; }
/* Outlook link fix */
#outlook a { padding:0; }
/* Resets: see reset.css for details */
.ReadMsgBody { width:100%; background-color:#ebebeb; }
/* Hotmail background and line height fixes */
.ExternalClass { width:100%; background-color:#ebebeb; }
.ExternalClass, .ExternalClass p, .ExternalClass span, .ExternalClass font,=
.ExternalClass td, .ExternalClass div { line-height:100%; }
有什么想法吗?
谢谢
您要解析 text/html 部分
您应该检查内容类型 == 'text/html'
我正在尝试解析电子邮件并将其中的表格转换为 pandas 数据帧。
由于一些电子邮件是多部分的,因此我从
以下代码工作正常,但它因 multipart/related 封电子邮件而中断(未找到表格)。
HOST = 'imap.gmail.com'
m = imaplib.IMAP4_SSL(HOST, 993)
m.login(USERNAME, PASSWORD)
m.select('Inbox')
result, data = m.uid('search', None, "UNSEEN", '(FROM "xxx@xxx.xxx")')
print(result)
if result == 'OK':
for num in data[0].split()[:]:
result, data = m.uid('fetch', num, '(RFC822)')
if result == 'OK':
email_message = email.message_from_bytes(data[0][1])
b = email_message
body = ""
print(b.is_multipart())
if b.is_multipart():
for part in b.walk():
ctype = part.get_content_type()
cdispo = str(part.get('Content-Disposition'))
# skip any text/plain (txt) attachments
if ctype == 'text/plain' and 'attachment' not in cdispo:
body = part.get_payload(decode=True) # decode
break
else:
body = b.get_payload(decode=True)
soup = BeautifulSoup(body)
table = soup.find_all('table')
df = pd.read_html(str(table))[0]
display(df)
这是 multipart/related 一封电子邮件的 header:
Delivered-To: xxxxxx@gmail.com
Received: by 2002:a05:6a10:cc86:0:0:0:0 with SMTP id gj6csp6140432pxb;
Mon, 27 Dec 2021 14:52:14 -0800 (PST)
X-Google-Smtp-Source: ABdhPJxPtKdKdVFNfgIE5xJdGrqDvekcD9MVkXdJaQyjJcVjc63N0KmOSN1LKvqLDbzssUU+6xjG
X-Received: by 2002:a05:620a:1132:: with SMTP id p18mr13912209qkk.778.1640645534051;
Mon, 27 Dec 2021 14:52:14 -0800 (PST)
ARC-Seal: i=1; a=rsa-sha256; t=1640645534; cv=none;
d=google.com; s=arc-20160816;
b=JUwqNu9ZFFy3j5ke7GddEIhpUGSdzB0gby+k5PFr3AwQv+/JtDY6p9ksOhReeFkQpd
2rNOhn9HknPnVpu1s+S9BT+YIrKWo8jrCzqJRWkaiY7MN80BGjw+oSkoD+WTNoo9rk7t
ojil3vIatY02Unl5FfYlOUxZbFZ7Xb3xT44Zd9lRI7aQNrLZxSjeQAF/oL+N8eE0rMXo
T5McU5R165sEb81twUpHrSkbp34/v31W25kOwx68Mb7hkuOTv/komZiQy1oiP+xzUKDH
CxKOgF/UgzVD5mhyB6DSSEN22DQ4ybrmshmd+B5wugSVlY9hfw0t89kJQGChKUphk9GH
/VWw==
ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816;
h=feedback-id:mime-version:date:message-id:subject:to:from
:dkim-signature:dkim-signature;
bh=iqw+mlksCZlkG8lxD5rVcYUL5uh/jJYU8nLc+GpCr/4=;
b=qnu0Xb2/dj8zwtelmnry7/okDbUj4QpsNPtWtovwrbtlDIpnSS8HRq4qzVzUy6TDFE
flm0XO489XNMO/GJ8Jw0J5Duujhnto3PiBRrAtIcA4CXkKhRe3SpXYk7D+PjROg+Zngk
5lqA9RgxerLMq+wMRD4WlcZVuWmmUtBhY/T9XbXOXUlJJJa9qn6AlKNOp5ZV8CDxweTp
yCDuQpJSCrbp1mldDe3N6lQAUXfaoGIBu6Kv7hpdZHwdrNMIeuhyCHTI4JF1IV0lK+G0
DzJg76RxnRQ3q0eacW9X/hzbMLZeljxfUO18BeDzRp45i3XqVyVsC53TirpmYv7OcB50
MaWA==
ARC-Authentication-Results: i=1; mx.google.com;
dkim=pass header.i=@xxxxxx.com header.s=xdzpvx2vm2fr73bjeppds7oqr3jbfy5s header.b=BATglTQY;
dkim=pass header.i=@amazonses.com header.s=ug7nbtf4gccmlpwj322ax3p6ow6yfsug header.b=YLjw7lGE;
spf=pass (google.com: domain of 0100017dfe181d85-07cbd269-a94a-4b30-8d52-1e4e48a34639-000000@amazonses.xxxxxx.com designates 54.240.11.40 as permitted sender) smtp.mailfrom=0100017dfe181d85-07cbd269-a94a-4b30-8d52-1e4e48a34639-000000@amazonses.xxxxxx.com;
dmarc=pass (p=REJECT sp=NONE dis=NONE) header.from=xxxxxx.com
Return-Path: <0100017dfe181d85-07cbd269-a94a-4b30-8d52-1e4e48a34639-000000@amazonses.xxxxxx.com>
Received: from a11-40.smtp-out.amazonses.com (a11-40.smtp-out.amazonses.com. [54.240.11.40])
by mx.google.com with ESMTPS id g19si4414275qtm.154.2021.12.27.14.52.13
for <xxxxxx@gmail.com>
(version=TLS1_2 cipher=ECDHE-ECDSA-AES128-GCM-SHA256 bits=128/128);
Mon, 27 Dec 2021 14:52:14 -0800 (PST)
Received-SPF: pass (google.com: domain of 0100017dfe181d85-07cbd269-a94a-4b30-8d52-1e4e48a34639-000000@amazonses.xxxxxx.com designates 54.240.11.40 as permitted sender) client-ip=54.240.11.40;
Authentication-Results: mx.google.com;
dkim=pass header.i=@xxxxxx.com header.s=xdzpvx2vm2fr73bjeppds7oqr3jbfy5s header.b=BATglTQY;
dkim=pass header.i=@amazonses.com header.s=ug7nbtf4gccmlpwj322ax3p6ow6yfsug header.b=YLjw7lGE;
spf=pass (google.com: domain of 0100017dfe181d85-07cbd269-a94a-4b30-8d52-1e4e48a34639-000000@amazonses.xxxxxx.com designates 54.240.11.40 as permitted sender) smtp.mailfrom=0100017dfe181d85-07cbd269-a94a-4b30-8d52-1e4e48a34639-000000@amazonses.xxxxxx.com;
dmarc=pass (p=REJECT sp=NONE dis=NONE) header.from=xxxxxx.com
DKIM-Signature: v=1; a=rsa-sha256; q=dns/txt; c=relaxed/simple; s=xdzpvx2vm2fr73bjeppds7oqr3jbfy5s; d=xxxxxx.com; t=1640645533; h=Content-Type:From:To:Subject:Message-ID:Date:MIME-Version; bh=y7l8Len/FG0elemUfgWg28W0SEj5eOJRIMBt9xIFrQo=; b=BATglTQY6PkcRChCgrX9BMdkZVwppc3CCPZ2QliEN6VGtr4YxW7l0C1n3mMgeRCL 0fXjKZwX3enRf9cHfKFJQErkxlmUfyKkLbtKJ4xNd78r4D04aCgUBRgovY05e2lE2vq KZEiJhF7oUN+QyxE87GahoQ88S/7cVjVVIh0RSHQ=
DKIM-Signature: v=1; a=rsa-sha256; q=dns/txt; c=relaxed/simple; s=ug7nbtf4gccmlpwj322ax3p6ow6yfsug; d=amazonses.com; t=1640645533; h=Content-Type:From:To:Subject:Message-ID:Date:MIME-Version:Feedback-ID; bh=y7l8Len/FG0elemUfgWg28W0SEj5eOJRIMBt9xIFrQo=; b=YLjw7lGEYZH+SQ4mx1EEdMVAo2v0EzbKGyGHmzH1CkvlnMv9yjMn4x3/BYhpOTxm yZ532qDZBGIIUPkCjoKOAz6K6a11xzPBREIl8Bz0O0kJyEcoShGahRbY4bgNCkOocx8 IJD+NREMTfVK6wlsxzoWRS+HAnVfg1pU80yORo7M=
Content-Type: multipart/related; type="text/html"; boundary="--_NmP-f890ebfb5c0d8a34-Part_1"
From: xxxxxx <noreply@xxxxxx.com>
To: xxxxxx@gmail.com
Subject: Watchlist Summary for Mon, December 27, 2021 (Futures)
Message-ID: <0100017dfe181d85-07cbd269-a94a-4b30-8d52-1e4e48a34639-000000@email.amazonses.com>
Date: Mon, 27 Dec 2021 22:52:13 +0000
MIME-Version: 1.0
Feedback-ID: 1.us-east-1.xy6STr9N8VtfY9IEmltVU/dtudHWlVMH37XgJn5/ROY=:AmazonSES
X-SES-Outgoing: 2021.12.27-54.240.11.40
----_NmP-f890ebfb5c0d8a34-Part_1
Content-Type: text/html; charset=utf-8
Content-Transfer-Encoding: quoted-printable
<!DOCTYPE html><html lang=3D"en"><head><meta charset=3D"UTF-8"><meta http-e=
quiv=3D"Content-Type" content=3D"text/html; charset=3DUTF-8"><meta http-equ=
iv=3D"X-UA-Compatible" content=3D"IE=3Dedge"><meta name=3D"viewport" conten=
t=3D"width=3Ddevice-width, initial-scale=3D1.0"><!-- So that mobile webkit =
will display zoomed in--><meta name=3D"format-detection" content=3D"telepho=
ne=3Dno"><!-- disable auto telephone linking in iOS--><title></title><style=
type=3D"text/css">}
.ad p {
margin-top: 4px;
}
</style><style type=3D"text/css">#data-table,
.data-table { max-width:100%; min-width:100%; width:100%; border-collapse:c=
ollapse; }
#data-table th,
#data-table td,
.data-table th,
.data-table td { color:#000000; border-collapse:collapse; padding:4px; whit=
e-space:nowrap; border:1px solid #D8D8D8; }
#data-table .body tr:nth-of-type(odd),
.data-table .body tr:nth-of-type(odd) { background-color:#f3f3f3; }
#data-table table tbody .spacer td,
.data-table table tbody .spacer td { border:none; }
.preHeaderHide { display:none !important; mso-hide:all !important; }
/* Outlook link fix */
#outlook a { padding:0; }
/* Resets: see reset.css for details */
.ReadMsgBody { width:100%; background-color:#ebebeb; }
/* Hotmail background and line height fixes */
.ExternalClass { width:100%; background-color:#ebebeb; }
.ExternalClass, .ExternalClass p, .ExternalClass span, .ExternalClass font,=
.ExternalClass td, .ExternalClass div { line-height:100%; }
有什么想法吗? 谢谢
您要解析 text/html 部分
您应该检查内容类型 == 'text/html'