Python 电子邮件 header 奇怪的行为

Python email header strange behavior

Python python2.7 或 python3 的电子邮件 header 解码器在编码和未编码文本之间切换时似乎有一些奇怪的行为。

from email.header import decode_header
print decode_header("=?ISO-8859-1?B?QA==?=example.com");
print decode_header("=?ISO-8859-1?B?QA==?= example.com");
print decode_header("=?ISO-8859-1?Q?=40example?= .com");
print decode_header("=?ISO-8859-1?Q?=40example?=.com");

这是结果

[('=?ISO-8859-1?B?QA==?=example.com', None)]
[('@', 'iso-8859-1'), ('example.com', None)]
[('@example', 'iso-8859-1'), ('.com', None)]
[('=?ISO-8859-1?Q?=40example?=.com', None)]

在所有示例输入中,encoded-text 只是 @ 符号,它应该得到正确解释,但事实并非如此。我认为 RFC 1342 的解释对我来说似乎不正确。 Python 期望 space 或换行符作为编码文本的结尾。我在 RFC 中没有看到这一点,RFC 只说在我阅读它时多个 encoded-text 之间需要 space,而不是 encoded-text 和文本的未编码部分之间。因此,无论何时看到“?=”,您都需要将其视为编码文本的结尾,而 python 则不会。我想问专家这是这里的错误还是我弄错了?

维杰

RFC 2047 defines 3 locations in which an 'encoded-word' may appear. It requires separating whitespace in almost all cases, even between an 'encoded-word' and unencoded text, and most of the cases where separating whitespace is not required appear to be errors. The text looks like this (without errata 应用,并手动调整格式):

An 'encoded-word' may appear in a message header or body part header according to the following rules:

  1. An 'encoded-word' may replace a 'text' token (as defined by RFC 822) in any Subject or Comments header field, any extension message header field, or any MIME body part field for which the field body is defined as '*text'. An 'encoded-word' may also appear in any user-defined ("X-") message or body part header field.

    Ordinary ASCII text and 'encoded-word's may appear together in the same header field. *However, an 'encoded-word' that appears in a header field defined as 'text' MUST be separated from any adjacent 'encoded-word' or 'text' by 'linear-white-space'.

  2. An 'encoded-word' may appear within a 'comment' delimited by "(" and ")", i.e., wherever a 'ctext' is allowed. More precisely, the RFC 822 ABNF definition for 'comment' is amended as follows:

     comment = "(" *(ctext / quoted-pair / comment / encoded-word) ")"
    

    A "Q"-encoded 'encoded-word' which appears in a 'comment' MUST NOT contain the characters "(", ")" or " 'encoded-word' that appears in a 'comment' MUST be separated from any adjacent 'encoded-word' or 'ctext' by 'linear-white-space'.

    It is important to note that 'comment's are only recognized inside "structured" field bodies. In fields whose bodies are defined as '*text', "(" and ")" are treated as ordinary characters rather than comment delimiters, and rule (1) of this section applies. (See RFC 822, sections 3.1.2 and 3.1.3)

  3. As a replacement for a 'word' entity within a 'phrase', for example, one that precedes an address in a From, To, or Cc header. The ABNF definition for 'phrase' from RFC 822 thus becomes:

     phrase = 1*( encoded-word / word )
    

    In this case the set of characters that may be used in a "Q"-encoded 'encoded-word' is restricted to: <upper and lower case ASCII letters, decimal digits, "!", "*", "+", "-", "/", "=", and "_" (underscore, ASCII 95.)>. An 'encoded-word' that appears within a 'phrase' MUST be separated from any adjacent 'word', 'text' or 'special' by 'linear-white-space'.

这来自 RFC1342 的第 6 页:

An encoded-word may be distinguished from an ordinary "word", "text", or "ctext", as follows: An encoded-word begins with "=?", ends with "?=", contains exactly four "?" characters including the delimiters, and is followed by a SPACE or newline. If the "word", "text", or "ctext" does not meet the above tests, it should be displayed as it appears in the message header.

所以spacenewline在编码文本之后是必需的。

来自同一 RFC 的 headers 编码示例:

   From: =?US-ASCII?Q?Keith_Moore?= <moore@cs.utk.edu>
   To: =?ISO-8859-1?Q?Keld_J=F8rn_Simonsen?= <keld@dkuug.dk>
   CC: =?ISO-8859-1?Q?Andr=E9_?= Pirard <PIRARD@vm1.ulg.ac.be>
   Subject: =?ISO-8859-1?B?SWYgeW91IGNhbiByZWFkIHRoaXMgeW8=?=
    =?ISO-8859-2?B?dSB1bmRlcnN0YW5kIHRoZSBleGFtcGxlLg==?=