正在解析 e-mails
Parsing e-mails
我正在尝试像这样拆分 mail-files:
Message-ID: <53197.1075859003723.JavaMail.evans@thyme>
Date: Tue, 23 Oct 2001 10:31:09 -0700 (PDT)
From: scott.dozier@enron.com
To: tom.donohoe@enron.com, bonnie.chang@enron.com, m..love@enron.com
Subject: RE: CMS Deal #1027152
Cc: lisa.valderrama@enron.com, thomas.mcfatridge@enron.com
Mime-Version: 1.0
Content-Type: text/plain; charset=us-ascii
Content-Transfer-Encoding: 7bit
Bcc: lisa.valderrama@enron.com, thomas.mcfatridge@enron.com
X-From: Dozier, Scott </O=ENRON/OU=NA/CN=RECIPIENTS/CN=SDOZIER>
X-To: Donohoe, Tom </O=ENRON/OU=NA/CN=RECIPIENTS/CN=Tdonoho>, Chang, Bonnie </O=ENRON/OU=NA/CN=RECIPIENTS/CN=Bchang>, Love, Phillip M. </O=ENRON/OU=NA/CN=RECIPIENTS/CN=Plove>
X-cc: Valderrama, Lisa </O=ENRON/OU=NA/CN=RECIPIENTS/CN=Lvalde2>, McFatridge, Thomas </O=ENRON/OU=NA/CN=RECIPIENTS/CN=Tmcfatri>
X-bcc:
X-Folder: \TDONOHO (Non-Privileged)\Inbox
X-Origin: Donohoe-T
X-FileName: TDONOHO (Non-Privileged).pst
I am not sure if they have confirmed either deal. However, deal #1034254 was never pathed by us, whereas 1027152 was. Therefore, nothing billed out under 1034254.
Bonnie - I am including you on this note in case you can add anything about the pathing of the two deals mentioned in this note. Niether CMS orgaination shows anything on Trunkline that matches this. We spoke briefly about this last week.
Phillip - I am including you in case you can add any clarity or determine who we did this deal(s) with.
Thank you,
Scott
5-7213
-----Original Message-----
From: Donohoe, Tom
Sent: Tuesday, October 23, 2001 12:02 PM
To: Dozier, Scott
Subject: RE: CMS Deal #1027152
if they are not confirming this deal are they confirming 1034254?
-----Original Message-----
From: Dozier, Scott
Sent: Tuesday, October 23, 2001 9:24 AM
To: Donohoe, Tom
Cc: Valderrama, Lisa; McFatridge, Thomas
Subject: RE: CMS Deal #1027152
Importance: High
Tom,
In contacting our scheduler and subsequently a CMS scheduler, neither CMS Field Services nor CMS Marketing, Services, and Trading are able to identify the deal. Currently, I am preparing to fax a copy of our confirmation on the deal to CMS Field Services, Again, it is not an executed copy, but I am assuming they may not have sent it back. Furthermore, the CMS Field Services scheduler has told me that they don't even schedule any Trunkline deals.
Considering all of this, I am assuming the worst - that unless we can provide a trader name etc. they will short pay on this deal. So, do you know who represented us with CMS on this deal any one that might know who their trader is or how this deal was booked? We are getting ready to settle for Sep prod so any help asap would be appreciated.
Scott
5-7213
-----Original Message-----
From: Dozier, Scott
Sent: Thursday, October 18, 2001 12:21 PM
To: Donohoe, Tom
Subject: RE: CMS Deal #1027152
They do not recognize that deal at all.
The most recent name and number is a Conoco trader. I have a confirmation on this deal with CMS. However, it is not an executed copy (i.e. sent back or confirmed by CMS). Is there some one who represented us with CMS on this that might know who their trader is or how this deal was booked? I will attempt to contact the scheduler in the mean time but any help would be good.
thanks.
在许多这样的文件中:
Header
Body
Original message 1
Original message 2
...
我已经阅读了一些关于拆分邮件的帖子,看来使用 Mime4j 应该是个好主意。所以我这样做了:
public class test {
public static void main(String[] args) throws IOException, MimeException {
// TODO Auto-generated method stub
MimeTokenStream stream = new MimeTokenStream();
stream.parse(new FileInputStream("test"));
File header = new File ("header");
File body = new File ("body");
BufferedWriter headerWriter = new BufferedWriter(new FileWriter(header));
BufferedWriter bodyWriter = new BufferedWriter(new FileWriter(body));
String str;
for (EntityState state = stream.getState();
state != EntityState.T_END_OF_STREAM;
state = stream.next()) {
switch (state) {
case T_BODY:
str = stream.getInputStream().toString();
bodyWriter.write(str);
break;
case T_FIELD:
str = stream.getField().toString() + "\n";
headerWriter.write(str);
break;
}
}
headerWriter.close();
bodyWriter.close();
}
}
此代码正确地将邮件分成两个文件:header 和 body。可能有更好的方法来做到这一点,但我发现 Mime4j Javadoc 不是很有帮助......好吧,我仍在努力完全理解它是如何工作的。
但是,我遇到了两个问题:
1) body 以明显由 Mime 创建的一行开始,看起来像这样:
[LineReaderInputStreamAdaptor: [pos: 937][limit: 4096][
而且我不知道如何摆脱它。
2) "original messages"都在body中。我不知道如何根据 "original messages" 将 body 分成更多部分。而且,所有的邮件都没有这种格式。有时原始消息 "revealed" 仅由制表符或每行前的 > 字符组成,或仅由小 header "from, to" 组成,或与另一行类似 ------转发--------等等...所以我不能使用格式拆分它。
我认为 Mime4j 应该将这些部分识别为 "Multipart" 消息,但它似乎不是(有一个案例 T_START_MULTIPART 但它没有找到任何东西。)
在您的示例中,最后一个 header 之后的所有内容都是 body 的一部分。电子邮件客户端对此有控制权(即 outlook 可能会添加原始消息分隔符,其他客户端可能会在其中添加缩进。这将根据客户端 and/or 语言设置而改变)因此您的解决方案很容易中断。检查内容类型。如果是纯文本,则需要扫描 body 以确定其分段方式。 (原始消息 -- 不是 mime 边界。)然后您将应用不同的规则来拆分消息(请参阅下面的 outlook 资源)。您还需要支持多部分(因为嵌入式电子邮件可能会通过这种方式发送。)然后知道您可能根本没有内容 headers。
这里有一些资源供您使用
您从写入头文件的 stream.getInputStream().toString();
中得到了看起来很奇怪的文本。
toString()
方法主要用于调试。在 InputStream
上调用它不会获取流的内容(可能很多),而只是对该流的描述,这就是您所看到的。
要获取该流的数据,您需要从输入流中读取它并将其复制到输出流。有关执行此操作的各种方法,请参阅 this answer。
就原始消息而言:您的示例是一封电子邮件。它只有 1 个 MIME 部分,纯文本部分。人们只是复制了原始消息并将他们的答案放在最上面,在他们正在回复的消息之上。
如果他们将邮件作为附件转发,MIME 结构看起来会有所不同:您会看到一个 Content-Type: multipart/mixed; boundary="..."
,然后该边界文本将分隔各个邮件。可能 Apache James 会检测到它们并正确处理它们。
MIME 多部分用于附件或电子邮件的替代部分(纯文本与 html)。不是指回复置顶的人。
由于您的示例电子邮件没有那种 MIME 结构,您最好的办法是手动解析电子邮件正文,寻找 -----Original Message-----
。请注意,这是脆弱的(您不知道人们的邮件客户端可能使用什么,人们可能会手动修改它(可能是偶然的))。
import org.apache.james.mime4j.stream.*;
import static org.apache.james.mime4j.stream.MimeTokenStream.*;
import java.io.*;
public class Library {
private static final String SEP = " -----Original Message-----";
private static final String CRLF = "\r\n";
static int fileNo = 0;
public static void main(String[] args) throws Exception {
MimeTokenStream stream = new MimeTokenStream();
stream.parse(new FileInputStream(args[0]));
try (BufferedWriter headerWriter = new BufferedWriter(new FileWriter("header"))) {
for (EntityState state = stream.getState();
state != EntityState.T_END_OF_STREAM;
state = stream.next()) {
switch (state) {
case T_BODY:
writePart(new BufferedReader(new InputStreamReader(stream.getInputStream())));
break;
case T_FIELD:
headerWriter.write(stream.getField().toString());
headerWriter.write(CRLF);
break;
}
}
}
}
private static void writePart(BufferedReader in) throws Exception {
BufferedWriter out = null;
try {
out = new BufferedWriter(new FileWriter(fileNo + ".eml"));
String line = in.readLine();
while (line != null) {
if (SEP.equals(line)) {
out.close();
fileNo++;
out = new BufferedWriter(new FileWriter(fileNo + ".eml"));
}
out.write(line);
out.write(CRLF);
line = in.readLine();
}
}
finally {
out.close();
}
}
}
我正在尝试像这样拆分 mail-files:
Message-ID: <53197.1075859003723.JavaMail.evans@thyme>
Date: Tue, 23 Oct 2001 10:31:09 -0700 (PDT)
From: scott.dozier@enron.com
To: tom.donohoe@enron.com, bonnie.chang@enron.com, m..love@enron.com
Subject: RE: CMS Deal #1027152
Cc: lisa.valderrama@enron.com, thomas.mcfatridge@enron.com
Mime-Version: 1.0
Content-Type: text/plain; charset=us-ascii
Content-Transfer-Encoding: 7bit
Bcc: lisa.valderrama@enron.com, thomas.mcfatridge@enron.com
X-From: Dozier, Scott </O=ENRON/OU=NA/CN=RECIPIENTS/CN=SDOZIER>
X-To: Donohoe, Tom </O=ENRON/OU=NA/CN=RECIPIENTS/CN=Tdonoho>, Chang, Bonnie </O=ENRON/OU=NA/CN=RECIPIENTS/CN=Bchang>, Love, Phillip M. </O=ENRON/OU=NA/CN=RECIPIENTS/CN=Plove>
X-cc: Valderrama, Lisa </O=ENRON/OU=NA/CN=RECIPIENTS/CN=Lvalde2>, McFatridge, Thomas </O=ENRON/OU=NA/CN=RECIPIENTS/CN=Tmcfatri>
X-bcc:
X-Folder: \TDONOHO (Non-Privileged)\Inbox
X-Origin: Donohoe-T
X-FileName: TDONOHO (Non-Privileged).pst
I am not sure if they have confirmed either deal. However, deal #1034254 was never pathed by us, whereas 1027152 was. Therefore, nothing billed out under 1034254.
Bonnie - I am including you on this note in case you can add anything about the pathing of the two deals mentioned in this note. Niether CMS orgaination shows anything on Trunkline that matches this. We spoke briefly about this last week.
Phillip - I am including you in case you can add any clarity or determine who we did this deal(s) with.
Thank you,
Scott
5-7213
-----Original Message-----
From: Donohoe, Tom
Sent: Tuesday, October 23, 2001 12:02 PM
To: Dozier, Scott
Subject: RE: CMS Deal #1027152
if they are not confirming this deal are they confirming 1034254?
-----Original Message-----
From: Dozier, Scott
Sent: Tuesday, October 23, 2001 9:24 AM
To: Donohoe, Tom
Cc: Valderrama, Lisa; McFatridge, Thomas
Subject: RE: CMS Deal #1027152
Importance: High
Tom,
In contacting our scheduler and subsequently a CMS scheduler, neither CMS Field Services nor CMS Marketing, Services, and Trading are able to identify the deal. Currently, I am preparing to fax a copy of our confirmation on the deal to CMS Field Services, Again, it is not an executed copy, but I am assuming they may not have sent it back. Furthermore, the CMS Field Services scheduler has told me that they don't even schedule any Trunkline deals.
Considering all of this, I am assuming the worst - that unless we can provide a trader name etc. they will short pay on this deal. So, do you know who represented us with CMS on this deal any one that might know who their trader is or how this deal was booked? We are getting ready to settle for Sep prod so any help asap would be appreciated.
Scott
5-7213
-----Original Message-----
From: Dozier, Scott
Sent: Thursday, October 18, 2001 12:21 PM
To: Donohoe, Tom
Subject: RE: CMS Deal #1027152
They do not recognize that deal at all.
The most recent name and number is a Conoco trader. I have a confirmation on this deal with CMS. However, it is not an executed copy (i.e. sent back or confirmed by CMS). Is there some one who represented us with CMS on this that might know who their trader is or how this deal was booked? I will attempt to contact the scheduler in the mean time but any help would be good.
thanks.
在许多这样的文件中:
Header
Body
Original message 1
Original message 2
...
我已经阅读了一些关于拆分邮件的帖子,看来使用 Mime4j 应该是个好主意。所以我这样做了:
public class test {
public static void main(String[] args) throws IOException, MimeException {
// TODO Auto-generated method stub
MimeTokenStream stream = new MimeTokenStream();
stream.parse(new FileInputStream("test"));
File header = new File ("header");
File body = new File ("body");
BufferedWriter headerWriter = new BufferedWriter(new FileWriter(header));
BufferedWriter bodyWriter = new BufferedWriter(new FileWriter(body));
String str;
for (EntityState state = stream.getState();
state != EntityState.T_END_OF_STREAM;
state = stream.next()) {
switch (state) {
case T_BODY:
str = stream.getInputStream().toString();
bodyWriter.write(str);
break;
case T_FIELD:
str = stream.getField().toString() + "\n";
headerWriter.write(str);
break;
}
}
headerWriter.close();
bodyWriter.close();
}
}
此代码正确地将邮件分成两个文件:header 和 body。可能有更好的方法来做到这一点,但我发现 Mime4j Javadoc 不是很有帮助......好吧,我仍在努力完全理解它是如何工作的。
但是,我遇到了两个问题:
1) body 以明显由 Mime 创建的一行开始,看起来像这样:
[LineReaderInputStreamAdaptor: [pos: 937][limit: 4096][
而且我不知道如何摆脱它。
2) "original messages"都在body中。我不知道如何根据 "original messages" 将 body 分成更多部分。而且,所有的邮件都没有这种格式。有时原始消息 "revealed" 仅由制表符或每行前的 > 字符组成,或仅由小 header "from, to" 组成,或与另一行类似 ------转发--------等等...所以我不能使用格式拆分它。
我认为 Mime4j 应该将这些部分识别为 "Multipart" 消息,但它似乎不是(有一个案例 T_START_MULTIPART 但它没有找到任何东西。)
在您的示例中,最后一个 header 之后的所有内容都是 body 的一部分。电子邮件客户端对此有控制权(即 outlook 可能会添加原始消息分隔符,其他客户端可能会在其中添加缩进。这将根据客户端 and/or 语言设置而改变)因此您的解决方案很容易中断。检查内容类型。如果是纯文本,则需要扫描 body 以确定其分段方式。 (原始消息 -- 不是 mime 边界。)然后您将应用不同的规则来拆分消息(请参阅下面的 outlook 资源)。您还需要支持多部分(因为嵌入式电子邮件可能会通过这种方式发送。)然后知道您可能根本没有内容 headers。
这里有一些资源供您使用
您从写入头文件的 stream.getInputStream().toString();
中得到了看起来很奇怪的文本。
toString()
方法主要用于调试。在 InputStream
上调用它不会获取流的内容(可能很多),而只是对该流的描述,这就是您所看到的。
要获取该流的数据,您需要从输入流中读取它并将其复制到输出流。有关执行此操作的各种方法,请参阅 this answer。
就原始消息而言:您的示例是一封电子邮件。它只有 1 个 MIME 部分,纯文本部分。人们只是复制了原始消息并将他们的答案放在最上面,在他们正在回复的消息之上。
如果他们将邮件作为附件转发,MIME 结构看起来会有所不同:您会看到一个 Content-Type: multipart/mixed; boundary="..."
,然后该边界文本将分隔各个邮件。可能 Apache James 会检测到它们并正确处理它们。
MIME 多部分用于附件或电子邮件的替代部分(纯文本与 html)。不是指回复置顶的人。
由于您的示例电子邮件没有那种 MIME 结构,您最好的办法是手动解析电子邮件正文,寻找 -----Original Message-----
。请注意,这是脆弱的(您不知道人们的邮件客户端可能使用什么,人们可能会手动修改它(可能是偶然的))。
import org.apache.james.mime4j.stream.*;
import static org.apache.james.mime4j.stream.MimeTokenStream.*;
import java.io.*;
public class Library {
private static final String SEP = " -----Original Message-----";
private static final String CRLF = "\r\n";
static int fileNo = 0;
public static void main(String[] args) throws Exception {
MimeTokenStream stream = new MimeTokenStream();
stream.parse(new FileInputStream(args[0]));
try (BufferedWriter headerWriter = new BufferedWriter(new FileWriter("header"))) {
for (EntityState state = stream.getState();
state != EntityState.T_END_OF_STREAM;
state = stream.next()) {
switch (state) {
case T_BODY:
writePart(new BufferedReader(new InputStreamReader(stream.getInputStream())));
break;
case T_FIELD:
headerWriter.write(stream.getField().toString());
headerWriter.write(CRLF);
break;
}
}
}
}
private static void writePart(BufferedReader in) throws Exception {
BufferedWriter out = null;
try {
out = new BufferedWriter(new FileWriter(fileNo + ".eml"));
String line = in.readLine();
while (line != null) {
if (SEP.equals(line)) {
out.close();
fileNo++;
out = new BufferedWriter(new FileWriter(fileNo + ".eml"));
}
out.write(line);
out.write(CRLF);
line = in.readLine();
}
}
finally {
out.close();
}
}
}