从电子邮件正文中删除额外的换行符
Removing additional line-breaks from email body
发送电子邮件时,许多服务器会添加额外的换行符以限制每行的长度。
如何在 PHP 脚本中提取电子邮件时恢复 原始 换行符?
例子
假设我发送以下内容:
Lorem ipsum Dolore incididunt in culpa ea ea sed quis sint voluptate quis laborum ullamco Excepteur do adipisicing consequat ex in reprehenderit officia in ad deserunt magna nulla dolor laborum occaecat reprehenderit aliquip dolor ea anim ea in veniam adipisicing culpa tempor qui elit voluptate consectetur elit laboris minim consectetur laboris anim incididunt Ut sunt sunt mollit elit irure do cillum dolore consequat in ea culpa ut velit sunt nulla in dolore voluptate dolore laborum reprehenderit dolore ut.
Ut non in veniam enim minim elit ad ut id ad eu voluptate cillum dolor laboris irure tempor mollit dolore exercitation eiusmod ea non ea ullamco nostrud cillum nostrud laborum commodo esse reprehenderit ut deserunt officia do in anim dolore ullamco pariatur ex amet nulla Excepteur mollit officia fugiat eu sed quis nisi fugiat dolor ea commodo ut sunt in consequat consectetur ut nulla pariatur est dolor dolore non ut occaecat officia Duis Ut ex exercitation esse ullamco nulla incididunt commodo pariatur dolore nostrud fugiat id dolor minim non sint amet adipisicing occaecat enim non Ut ad irure sint aliquip nisi ut commodo minim proident elit nulla quis ut ad dolor Excepteur dolore Duis.
请注意,此文本中只有一个换行符!
在接收端使用Thunderbird查看邮件源代码,或者通过PHP获取邮件正文,内容格式如下:
Lorem ipsum Dolore incididunt in culpa ea ea sed quis sint voluptate
quis laborum ullamco Excepteur do adipisicing consequat ex in
reprehenderit officia in ad deserunt magna nulla dolor laborum occaecat
reprehenderit aliquip dolor ea anim ea in veniam adipisicing culpa
tempor qui elit voluptate consectetur elit laboris minim consectetur
laboris anim incididunt Ut sunt sunt mollit elit irure do cillum dolore
consequat in ea culpa ut velit sunt nulla in dolore voluptate dolore
laborum reprehenderit dolore ut.
Ut non in veniam enim minim elit ad ut id ad eu voluptate cillum dolor
laboris irure tempor mollit dolore exercitation eiusmod ea non ea
ullamco nostrud cillum nostrud laborum commodo esse reprehenderit ut
deserunt officia do in anim dolore ullamco pariatur ex amet nulla
Excepteur mollit officia fugiat eu sed quis nisi fugiat dolor ea commodo
ut sunt in consequat consectetur ut nulla pariatur est dolor dolore non
ut occaecat officia Duis Ut ex exercitation esse ullamco nulla
incididunt commodo pariatur dolore nostrud fugiat id dolor minim non
sint amet adipisicing occaecat enim non Ut ad irure sint aliquip nisi ut
commodo minim proident elit nulla quis ut ad dolor Excepteur dolore Duis.
请注意,每行都有一定的长度限制,因此存在 16 个额外的换行符。这些额外的换行符自动添加到导致我收到电子邮件的事件链中的某个位置。
我希望我的电子邮件提取 PHP 脚本删除额外的换行符以恢复内容的原始两行格式。
我知道新换行符不是由 PHP 脚本添加的,我知道它们来自哪里,但我不知道如何使我的 PHP 脚本删除那些换行符。
这是用于获取电子邮件正文的代码:
$connection = imap_open(
sprintf(
'{%s:110/pop3}INBOX',
Configure::read('Email.Inbox.host')
),
Configure::read('Email.Inbox.email'),
Configure::read('Email.Inbox.password')
);
$mailbox = imap_check($connection);
$messages = imap_fetch_overview($connection, '1:' . $mailbox->Nmsgs);
foreach($messages as $message) {
$content = imap_fetchbody($connection, $message->msgno, 1);
}
我尝试了什么?
我尝试使用 imap_body
而不是 imap_fetchbody
,因为前者不处理电子邮件正文。但是额外的换行符在此之前已经存在,并且与常规换行符没有区别。两者都包含 \r\n
.
我认为必须有一种方法可以做到这一点,因为 Thunderbird 以正确的格式显示收到的电子邮件,没有额外的 16 个换行符,尽管它们存在于显示消息的源代码中。所以可能必须有一种方法可以从电子邮件中删除额外的 16 个换行符。
这是 Thunderbird 的屏幕截图,顶部显示了电子邮件的源代码,底部显示了生成的纯文本。
尽管这个问题很老,但当我 运行 遇到这个完全相同的问题时,它是最热门的问题之一。正如 Marc 在评论中指出的那样,它确实与 format=flowed
有关。所以我深入研究了 RFC 2646,发现 section 4.1、Generating Format=Flowed:
Because a soft line break is a SP CRLF sequence, the generating agent creates one by inserting a CRLF after the occurance of a space.
A generating agent SHOULD NOT insert white space into a word (a sequence of printable characters not containing spaces). If faced with a word which exceeds 79 characters (but less than 998 characters, the [SMTP] limit on line length), the agent SHOULD send the word as is and exceed the 79-character limit on line length.
因此,为了获得最初编写的电子邮件,只需搜索所有出现的 SP+CRLF 并将它们替换为空。那么您可能还想撤消 space-stuffing,同时还要考虑引用的文本(以任意数量的 >
字符开头的行,后跟 space)。根据 RFC,测试顺序是引号 > space 填充 > 流线:
On reception, if the first character of a line is a space, it is logically deleted. This occurs after the test for a quoted line, and before the test for a flowed line.
来自我自己厨房的粗略 PoC:
// I'm using fetchmime() because I want to be sure I'm getting the proper MIME type for the relevant section
$mimes = imap_fetchmime($connection, $message->msgno, $section);
// I don't want to store all headers in an array since I just want to know the Content-Type
// [ \t]* is probably not necessary but it's there in case of broken clients/servers
if(preg_match('/^[ \t]*Content-Type.*format=flowed\b/mi', $mimes)) {
// First, let's undo space stuffing but don't touch stuffed lines with quotes
$content = preg_replace('/^ +(?!>+ )/m', '', $content);
// Then, remove flowed SP+(CR)LF sequences as well as any possible quotation marks that might appear after it to reform one long line of text
$content = preg_replace('/( )\r?\n(>+ +)?/', '', $content);
// Remove empty quoted lines at *the end of the string only*, keeping any such lines anywhere else as-is for readability
$content = preg_replace('/(\r?\n>+\s*)+$/', '', $content);
}
// And finally trim the entire thing (regardless of formatting)
$content = trim($content);
// Or when outputting to browsers:
//$content = nl2br(trim($content));
对我来说这适用于:
- 简单one-line 电子邮件
- OP 给出的 lorem ipsum 示例有 2 个段落
- one-liners 后跟 2 个换行符和一个包含 2 行的签名
- 最多 4 级引用的电子邮件(可能超过但我没有费心去检查那么多)
发送电子邮件时,许多服务器会添加额外的换行符以限制每行的长度。
如何在 PHP 脚本中提取电子邮件时恢复 原始 换行符?
例子
假设我发送以下内容:
Lorem ipsum Dolore incididunt in culpa ea ea sed quis sint voluptate quis laborum ullamco Excepteur do adipisicing consequat ex in reprehenderit officia in ad deserunt magna nulla dolor laborum occaecat reprehenderit aliquip dolor ea anim ea in veniam adipisicing culpa tempor qui elit voluptate consectetur elit laboris minim consectetur laboris anim incididunt Ut sunt sunt mollit elit irure do cillum dolore consequat in ea culpa ut velit sunt nulla in dolore voluptate dolore laborum reprehenderit dolore ut.
Ut non in veniam enim minim elit ad ut id ad eu voluptate cillum dolor laboris irure tempor mollit dolore exercitation eiusmod ea non ea ullamco nostrud cillum nostrud laborum commodo esse reprehenderit ut deserunt officia do in anim dolore ullamco pariatur ex amet nulla Excepteur mollit officia fugiat eu sed quis nisi fugiat dolor ea commodo ut sunt in consequat consectetur ut nulla pariatur est dolor dolore non ut occaecat officia Duis Ut ex exercitation esse ullamco nulla incididunt commodo pariatur dolore nostrud fugiat id dolor minim non sint amet adipisicing occaecat enim non Ut ad irure sint aliquip nisi ut commodo minim proident elit nulla quis ut ad dolor Excepteur dolore Duis.
请注意,此文本中只有一个换行符!
在接收端使用Thunderbird查看邮件源代码,或者通过PHP获取邮件正文,内容格式如下:
Lorem ipsum Dolore incididunt in culpa ea ea sed quis sint voluptate
quis laborum ullamco Excepteur do adipisicing consequat ex in
reprehenderit officia in ad deserunt magna nulla dolor laborum occaecat
reprehenderit aliquip dolor ea anim ea in veniam adipisicing culpa
tempor qui elit voluptate consectetur elit laboris minim consectetur
laboris anim incididunt Ut sunt sunt mollit elit irure do cillum dolore
consequat in ea culpa ut velit sunt nulla in dolore voluptate dolore
laborum reprehenderit dolore ut.
Ut non in veniam enim minim elit ad ut id ad eu voluptate cillum dolor
laboris irure tempor mollit dolore exercitation eiusmod ea non ea
ullamco nostrud cillum nostrud laborum commodo esse reprehenderit ut
deserunt officia do in anim dolore ullamco pariatur ex amet nulla
Excepteur mollit officia fugiat eu sed quis nisi fugiat dolor ea commodo
ut sunt in consequat consectetur ut nulla pariatur est dolor dolore non
ut occaecat officia Duis Ut ex exercitation esse ullamco nulla
incididunt commodo pariatur dolore nostrud fugiat id dolor minim non
sint amet adipisicing occaecat enim non Ut ad irure sint aliquip nisi ut
commodo minim proident elit nulla quis ut ad dolor Excepteur dolore Duis.
请注意,每行都有一定的长度限制,因此存在 16 个额外的换行符。这些额外的换行符自动添加到导致我收到电子邮件的事件链中的某个位置。
我希望我的电子邮件提取 PHP 脚本删除额外的换行符以恢复内容的原始两行格式。
我知道新换行符不是由 PHP 脚本添加的,我知道它们来自哪里,但我不知道如何使我的 PHP 脚本删除那些换行符。
这是用于获取电子邮件正文的代码:
$connection = imap_open(
sprintf(
'{%s:110/pop3}INBOX',
Configure::read('Email.Inbox.host')
),
Configure::read('Email.Inbox.email'),
Configure::read('Email.Inbox.password')
);
$mailbox = imap_check($connection);
$messages = imap_fetch_overview($connection, '1:' . $mailbox->Nmsgs);
foreach($messages as $message) {
$content = imap_fetchbody($connection, $message->msgno, 1);
}
我尝试了什么?
我尝试使用 imap_body
而不是 imap_fetchbody
,因为前者不处理电子邮件正文。但是额外的换行符在此之前已经存在,并且与常规换行符没有区别。两者都包含 \r\n
.
我认为必须有一种方法可以做到这一点,因为 Thunderbird 以正确的格式显示收到的电子邮件,没有额外的 16 个换行符,尽管它们存在于显示消息的源代码中。所以可能必须有一种方法可以从电子邮件中删除额外的 16 个换行符。
这是 Thunderbird 的屏幕截图,顶部显示了电子邮件的源代码,底部显示了生成的纯文本。
尽管这个问题很老,但当我 运行 遇到这个完全相同的问题时,它是最热门的问题之一。正如 Marc 在评论中指出的那样,它确实与 format=flowed
有关。所以我深入研究了 RFC 2646,发现 section 4.1、Generating Format=Flowed:
Because a soft line break is a SP CRLF sequence, the generating agent creates one by inserting a CRLF after the occurance of a space.
A generating agent SHOULD NOT insert white space into a word (a sequence of printable characters not containing spaces). If faced with a word which exceeds 79 characters (but less than 998 characters, the [SMTP] limit on line length), the agent SHOULD send the word as is and exceed the 79-character limit on line length.
因此,为了获得最初编写的电子邮件,只需搜索所有出现的 SP+CRLF 并将它们替换为空。那么您可能还想撤消 space-stuffing,同时还要考虑引用的文本(以任意数量的 >
字符开头的行,后跟 space)。根据 RFC,测试顺序是引号 > space 填充 > 流线:
On reception, if the first character of a line is a space, it is logically deleted. This occurs after the test for a quoted line, and before the test for a flowed line.
来自我自己厨房的粗略 PoC:
// I'm using fetchmime() because I want to be sure I'm getting the proper MIME type for the relevant section
$mimes = imap_fetchmime($connection, $message->msgno, $section);
// I don't want to store all headers in an array since I just want to know the Content-Type
// [ \t]* is probably not necessary but it's there in case of broken clients/servers
if(preg_match('/^[ \t]*Content-Type.*format=flowed\b/mi', $mimes)) {
// First, let's undo space stuffing but don't touch stuffed lines with quotes
$content = preg_replace('/^ +(?!>+ )/m', '', $content);
// Then, remove flowed SP+(CR)LF sequences as well as any possible quotation marks that might appear after it to reform one long line of text
$content = preg_replace('/( )\r?\n(>+ +)?/', '', $content);
// Remove empty quoted lines at *the end of the string only*, keeping any such lines anywhere else as-is for readability
$content = preg_replace('/(\r?\n>+\s*)+$/', '', $content);
}
// And finally trim the entire thing (regardless of formatting)
$content = trim($content);
// Or when outputting to browsers:
//$content = nl2br(trim($content));
对我来说这适用于:
- 简单one-line 电子邮件
- OP 给出的 lorem ipsum 示例有 2 个段落
- one-liners 后跟 2 个换行符和一个包含 2 行的签名
- 最多 4 级引用的电子邮件(可能超过但我没有费心去检查那么多)