如何用 python 解析电子邮件 header?

How can I parse an email header with python?

这是一个示例电子邮件 header,

header = """
From: Media Temple user (mt.kb.user@gmail.com)
Subject: article: A sample header
Date: January 25, 2011 3:30:58 PM PDT
To: user@example.com
Return-Path: <mt.kb.user@gmail.com>
Envelope-To: user@example.com
Delivery-Date: Tue, 25 Jan 2011 15:31:01 -0700
Received: from :po-out-1718.google.com ([72.14.252.155]:54907) by cl35.gs01.gridserver.com with esmtp (Exim 4.63) (envelope-from <mt.kb.user@gmail.com>) id 1KDoNH-0000f0-RL for user@example.com; Tue, 25 Jan 2011 15:31:01 -0700
Dkim-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=gamma; h=domainkey-signature:received:received:message-id:date:from:to :subject:mime-version:content-type; bh=+JqkmVt+sHDFIGX5jKp3oP18LQf10VQjAmZAKl1lspY=; b=F87jySDZnMayyitVxLdHcQNL073DytKRyrRh84GNsI24IRNakn0oOfrC2luliNvdea LGTk3adIrzt+N96GyMseWz8T9xE6O/sAI16db48q4Iqkd7uOiDvFsvS3CUQlNhybNw8m CH/o8eELTN0zbSbn5Trp0dkRYXhMX8FTAwrH0=
Domainkey-Signature: a=rsa-sha1; c=nofws; d=gmail.com; s=gamma; h=message-id:date:from:to:subject:mime-version:content-type; b=wkbBj0M8NCUlboI6idKooejg0sL2ms7fDPe1tHUkR9Ht0qr5lAJX4q9PMVJeyjWalH 36n4qGLtC2euBJY070bVra8IBB9FeDEW9C35BC1vuPT5XyucCm0hulbE86+uiUTXCkaB 6ykquzQGCer7xPAcMJqVfXDkHo3H61HM9oCQM=
Message-Id: <c8f49cec0807011530k11196ad4p7cb4b9420f2ae752@mail.gmail.com>
Mime-Version: 1.0
Content-Type: multipart/alternative; boundary="----=_Part_3927_12044027.1214951458678"
X-Spam-Status: score=3.7 tests=DNS_FROM_RFC_POST, HTML_00_10, HTML_MESSAGE, HTML_SHORT_LENGTH version=3.1.7
X-Spam-Level: ***
Message Body: **The email message body**
"""

header 存储为字符串,我如何解析这个 header,以便我可以将它映射到字典,因为 header 字段是键, values 是字典中的值吗?

我想要这样的字典,

header_dict = {
'From': 'Media Temple user (mt.kb.user@gmail.com)',
'Subject': article: 'A sample header',
'Date': 'January 25, 2011 3:30:58 PM PDT'
'and so on': .. . . . .. . . .. . 
 . . . . .. . . . ..  . . . . .
} 

我列出了必填字段,

header_reqd = ['From:','Subject:','Date:','To:','Return-Path:','Envelope-To:','Delivery-Date:','Received:','Dkim-Signature:','Domainkey-Signature:','Message-Id:','Mime-Version:','Content-Type:','X-Spam-Status:','X-Spam-Level:','Message Body:']

这可以列出项目可能是字典的键。

您可以在换行符处拆分字符串,然后在“:”处拆分每行

>>> my_header = {}
>>> for x in header.strip().split("\n"):
...     x = x.split(":", 1)
...     my_header[x[0]] = x[1]
... 

split 适合你:

演示:

>>> result = {}
>>> for i in header.split("\n"):
...    i = i.strip()
...    if i :
...       k, v = i.split(":", 1)
...       result[k] = v

输出:

>>> import pprint
>>> pprint.pprint(result)
{'Content-Type': ' multipart/alternative; boundary="----=_Part_3927_12044027.1214951458678"',
 'Date': ' January 25, 2011 3:30:58 PM PDT',
 'Delivery-Date': ' Tue, 25 Jan 2011 15:31:01 -0700',
 'Dkim-Signature': ' v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=gamma; h=domainkey-signature:received:received:message-id:date:from:to :subject:mime-version:content-type; bh=+JqkmVt+sHDFIGX5jKp3oP18LQf10VQjAmZAKl1lspY=; b=F87jySDZnMayyitVxLdHcQNL073DytKRyrRh84GNsI24IRNakn0oOfrC2luliNvdea LGTk3adIrzt+N96GyMseWz8T9xE6O/sAI16db48q4Iqkd7uOiDvFsvS3CUQlNhybNw8m CH/o8eELTN0zbSbn5Trp0dkRYXhMX8FTAwrH0=',
 'Domainkey-Signature': ' a=rsa-sha1; c=nofws; d=gmail.com; s=gamma; h=message-id:date:from:to:subject:mime-version:content-type; b=wkbBj0M8NCUlboI6idKooejg0sL2ms7fDPe1tHUkR9Ht0qr5lAJX4q9PMVJeyjWalH 36n4qGLtC2euBJY070bVra8IBB9FeDEW9C35BC1vuPT5XyucCm0hulbE86+uiUTXCkaB 6ykquzQGCer7xPAcMJqVfXDkHo3H61HM9oCQM=',
 'Envelope-To': ' user@example.com',
 'From': ' Media Temple user (mt.kb.user@gmail.com)',
 'Message Body': ' **The email message body**',
 'Message-Id': ' <c8f49cec0807011530k11196ad4p7cb4b9420f2ae752@mail.gmail.com>',
 'Mime-Version': ' 1.0',
 'Received': ' from :po-out-1718.google.com ([72.14.252.155]:54907) by cl35.gs01.gridserver.com with esmtp (Exim 4.63) (envelope-from <mt.kb.user@gmail.com>) id 1KDoNH-0000f0-RL for user@example.com; Tue, 25 Jan 2011 15:31:01 -0700',
 'Return-Path': ' <mt.kb.user@gmail.com>',
 'Subject': ' article: A sample header',
 'To': ' user@example.com',
 'X-Spam-Level': ' ***',
 'X-Spam-Status': ' score=3.7 tests=DNS_FROM_RFC_POST, HTML_00_10, HTML_MESSAGE, HTML_SHORT_LENGTH version=3.1.7'}
header = """From: Media Temple user (mt.kb.user@gmail.com)
Subject: article: A sample header
Date: January 25, 2011 3:30:58 PM PDT
To: user@example.com
Return-Path: <mt.kb.user@gmail.com>
Envelope-To: user@example.com
Delivery-Date: Tue, 25 Jan 2011 15:31:01 -0700
Received: from :po-out-1718.google.com ([72.14.252.155]:54907) by cl35.gs01.gridserver.com with esmtp (Exim 4.63) (envelope-from <mt.kb.user@gmail.com>) id 1KDoNH-0000f0-RL for user@example.com; Tue, 25 Jan 2011 15:31:01 -0700
Dkim-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=gamma; h=domainkey-signature:received:received:message-id:date:from:to :subject:mime-version:content-type; bh=+JqkmVt+sHDFIGX5jKp3oP18LQf10VQjAmZAKl1lspY=; b=F87jySDZnMayyitVxLdHcQNL073DytKRyrRh84GNsI24IRNakn0oOfrC2luliNvdea LGTk3adIrzt+N96GyMseWz8T9xE6O/sAI16db48q4Iqkd7uOiDvFsvS3CUQlNhybNw8m CH/o8eELTN0zbSbn5Trp0dkRYXhMX8FTAwrH0=
Domainkey-Signature: a=rsa-sha1; c=nofws; d=gmail.com; s=gamma; h=message-id:date:from:to:subject:mime-version:content-type; b=wkbBj0M8NCUlboI6idKooejg0sL2ms7fDPe1tHUkR9Ht0qr5lAJX4q9PMVJeyjWalH 36n4qGLtC2euBJY070bVra8IBB9FeDEW9C35BC1vuPT5XyucCm0hulbE86+uiUTXCkaB 6ykquzQGCer7xPAcMJqVfXDkHo3H61HM9oCQM=
Message-Id: <c8f49cec0807011530k11196ad4p7cb4b9420f2ae752@mail.gmail.com>
Mime-Version: 1.0
Content-Type: multipart/alternative; boundary="----=_Part_3927_12044027.1214951458678"
X-Spam-Status: score=3.7 tests=DNS_FROM_RFC_POST, HTML_00_10, HTML_MESSAGE, HTML_SHORT_LENGTH version=3.1.7
X-Spam-Level: ***
Message Body: **The email message body**
"""   

拆分成单独的行,然后在 :

上拆分每行一次
from pprint import pprint as pp
pp(dict(line.split(":",1) for line in header.splitlines()))

输出:

{'Content-Type': ' multipart/alternative; '
                 'boundary="----=_Part_3927_12044027.1214951458678"',
 'Date': ' January 25, 2011 3:30:58 PM PDT',
 'Delivery-Date': ' Tue, 25 Jan 2011 15:31:01 -0700',
 'Dkim-Signature': ' v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; '
                   's=gamma; '
                   'h=domainkey-signature:received:received:message-id:date:from:to '
                   ':subject:mime-version:content-type; '
                   'bh=+JqkmVt+sHDFIGX5jKp3oP18LQf10VQjAmZAKl1lspY=; '
                   'b=F87jySDZnMayyitVxLdHcQNL073DytKRyrRh84GNsI24IRNakn0oOfrC2luliNvdea '
                   'LGTk3adIrzt+N96GyMseWz8T9xE6O/sAI16db48q4Iqkd7uOiDvFsvS3CUQlNhybNw8m '
                   'CH/o8eELTN0zbSbn5Trp0dkRYXhMX8FTAwrH0=',
 'Domainkey-Signature': ' a=rsa-sha1; c=nofws; d=gmail.com; s=gamma; '
                        'h=message-id:date:from:to:subject:mime-version:content-type; '
                        'b=wkbBj0M8NCUlboI6idKooejg0sL2ms7fDPe1tHUkR9Ht0qr5lAJX4q9PMVJeyjWalH '
                        '36n4qGLtC2euBJY070bVra8IBB9FeDEW9C35BC1vuPT5XyucCm0hulbE86+uiUTXCkaB '
                        '6ykquzQGCer7xPAcMJqVfXDkHo3H61HM9oCQM=',
 'Envelope-To': ' user@example.com',
 'From': ' Media Temple user (mt.kb.user@gmail.com)',
 'Message Body': ' **The email message body**',
 'Message-Id': ' '
               '<c8f49cec0807011530k11196ad4p7cb4b9420f2ae752@mail.gmail.com>',
 'Mime-Version': ' 1.0',
 'Received': ' from :po-out-1718.google.com ([72.14.252.155]:54907) by '
             'cl35.gs01.gridserver.com with esmtp (Exim 4.63) (envelope-from '
             '<mt.kb.user@gmail.com>) id 1KDoNH-0000f0-RL for '
             'user@example.com; Tue, 25 Jan 2011 15:31:01 -0700',
 'Return-Path': ' <mt.kb.user@gmail.com>',
 'Subject': ' article: A sample header',
 'To': ' user@example.com',
 'X-Spam-Level': ' ***',
 'X-Spam-Status': ' score=3.7 tests=DNS_FROM_RFC_POST, HTML_00_10, '
                  'HTML_MESSAGE, HTML_SHORT_LENGTH version=3.1.7'}

line.split(":",1) 确保我们只在 : 上拆分一次,因此如果值中有任何 : 我们也不会最终拆分它。您最终得到的子列表是 key/value 对,因此调用 dict 从每个配对创建 dict

似乎大多数这些答案都忽略了 Python email parser 并且输出结果不正确,值中有前缀空格。此外,OP 可能通过在 header 字符串中包含一个前面的换行符来打错字,这需要剥离电子邮件解析器才能工作。

from email.parser import HeaderParser
header = header.strip() # Fix incorrect formatting
email_message = HeaderParser().parsestr(header)
dict(email_message)

输出(截断):

>>> from pprint import pprint
>>> pprint(dict(email_message))
{'Content-Type': 'multipart/alternative; '
                 'boundary="----=_Part_3927_12044027.1214951458678"',
 'Date': 'January 25, 2011 3:30:58 PM PDT',
 'Delivery-Date': 'Tue, 25 Jan 2011 15:31:01 -0700',
 ...
 'Subject': 'article: A sample header',
 'To': 'user@example.com',
 'X-Spam-Level': '***',
 'X-Spam-Status': 'score=3.7 tests=DNS_FROM_RFC_POST, HTML_00_10, '
                  'HTML_MESSAGE, HTML_SHORT_LENGTH version=3.1.7'}

重复 header 个键

请注意,电子邮件 headers 可能包含重复密钥,如 email.message

的 Python 文档中所述

Headers are stored and returned in case-preserving form, but field names are matched case-insensitively. Unlike a real dict, there is an ordering to the keys, and there can be duplicate keys. Additional methods are provided for working with headers that have duplicate keys.

例如,将以下电子邮件消息转换为 Python 字典,仅保留第一个 Received 键。

headers = HeaderParser().parsestr("""Received: by mx0047p1mdw1.sendgrid.net with SMTP id 6WCVv7KAWn Wed, 27 Jul 2016 20:53:06 +0000 (UTC)
Received: from mail-io0-f169.google.com (mail-io0-f169.google.com [209.85.223.169]) by mx0047p1mdw1.sendgrid.net (Postfix) with ESMTPS id AA9FFA817F2 for <example@example.comom>; Wed, 27 Jul 2016 20:53:06 +0000 (UTC)
Received: by mail-io0-f169.google.com with SMTP id b62so81593819iod.3 for <example@example.comom>; Wed, 27 Jul 2016 13:53:06 -0700 (PDT)""")

dict(headers)
{'Received': 'by mx0047p1mdw1.sendgrid.net with SMTP id 6WCVv7KAWn Wed, 27 Jul 2016 20:53:06 +0000 (UTC)'}

使用get_all方法检查重复项:

headers.get_all('Received')
['by mx0047p1mdw1.sendgrid.net with SMTP id 6WCVv7KAWn Wed, 27 Jul 2016 20:53:06 +0000 (UTC)', 'from mail-io0-f169.google.com (mail-io0-f169.google.com [209.85.223.169]) by mx0047p1mdw1.sendgrid.net (Postfix) with ESMTPS id AA9FFA817F2 for <example@example.comom>; Wed, 27 Jul 2016 20:53:06 +0000 (UTC)', 'by mail-io0-f169.google.com with SMTP id b62so81593819iod.3 for <example@example.comom>; Wed, 27 Jul 2016 13:53:06 -0700 (PDT)']