将文本格式的电子邮件合并为一个用于机器学习的 csv 文件
Merge emails in text format into one csv file for machine learning
我正在使用 Enron 数据集解决机器学习问题。我想将所有垃圾邮件文件合并到一个 csv 文件中,并将所有 ham 文件合并到另一个 csv 文件中以供进一步分析。
我正在使用此处列出的数据集:https://github.com/crossedbanana/Enron-Email-Classification
我使用下面的代码合并了电子邮件,并且能够合并它们。但是,当我尝试读取 csv 文件并将其加载到 pandas 时,由于 ParserError: Error tokenizing data. C error: Expected 1 fields in line 8, saw 2
而出现错误
将 txt 格式的电子邮件文件合并到 csv 中的代码
import os
for f in glob.glob("./dataset_temp/spam/*.txt"):
os.system("cat "+f+" >> OutFile1.csv")
Code to load into pandas:
```# reading the csv into pandas
emails = pd.read_csv('OutFile1.csv')
print(emails.shape)```
1. How can I get rid of the parser error? this is occuring due to commas present in the email messages I think.
2. How can I just load each email message into pandas with just the email body?
This is how the email format looks like(an example of a text file in the spam folder)
The commas in line 3 are causing a problem while loading into pandas
*Subject: your prescription is ready . . oxwq s f e
low cost prescription medications
soma , ultram , adipex , vicodin many more
prescribed online and shipped
overnight to your door ! !
one of our us licensed physicians will write an
fda approved prescription for you and ship your
order overnight via a us licensed pharmacy direct
to your doorstep . . . . fast and secure ! !
click here !
no thanks , please take me off your list
ogrg z
lqlokeolnq
lnu*
Thanks for any help.
您可以使用 excel 文件来代替读取和写入 CSV 文件中的数据。
所以你不会因为','(逗号)而得到任何错误。
只需将 csv 替换为 excel.
这是一个例子:
import os
import pandas as pd
import codecs
# Function to create list of emails.
def create_email_list(folder_path):
email_list = []
folder = os.listdir(folder_path)#provide folder path, if the folder is in same directory provide only the folder name
for txt in folder:
file_name = fr'{folder_path}/{txt}'
#read emails
with codecs.open(file_name, 'r', encoding='utf-8',errors='ignore') as f:
email = f.read()
email_list.append(email)
return email_list
spam_list = create_email_list('spam')#calling the function for reading spam
spam_df = pd.DataFrame(spam_list)#creating a dataframe of spam
spam_df.to_excel('spam.xlsx')#creating excel file of spam
ham_list = create_email_list('ham')#calling the function for reading ham
ham_df = pd.DataFrame(ham_list)#creating a dataframe of spam
ham_df.to_excel('ham.xlsx')#creating excel file of ham
您只需要在函数中传递文件夹路径即可(文件夹名称是文件夹在同一目录下)。此代码将创建 excel 个文件。
为避免 ,
出现问题,您可以使用不同的分隔符(例如 |
)或在字段周围加上引号:
"soma , ultram , adipex , vicodin many more"
如果字段中有引号,您必须用另一个引号将其转义:
"soma , ultram , ""adipex"" , vicodin many more"
但是,您的示例将对每封邮件中的每一行都有一个 csv 记录。每封电子邮件有一条记录可能更合乎逻辑:
subject,body
your prescription is ready . . oxwq s f e,"low cost prescription medications
soma , ultram , adipex , vicodin many more
prescribed online and shipped
overnight to your door ! !
one of our us licensed physicians will write an
fda approved prescription for you and ship your
order overnight via a us licensed pharmacy direct
to your doorstep . . . . fast and secure ! !
click here !
no thanks , please take me off your list
ogrg z
lqlokeolnq
lnu"
test subject2,"test
body 2"
以上示例为您提供了一个包含 2 列的 table:subject
和 body
,其中 body
是一个用双引号括起来的多行字段。
我这样解决了我的问题。先阅读所有的txt文件
```
BASE_DIR = './'
SPAM_DIR = './spam'
def load_text_file(filenames):
text_list = []
for filename in filenames:
with codecs.open(filename, "r", "utf-8", errors = 'ignore') as f:
text = f.read().replace('\r\n', ' ')
text_list.append(text)
return text_list
# add it to a list with filenames
ham_filenames = glob.glob( BASE_DIR + HAM_DIR + '*.txt')
ham_list = load_text_file(ham_filenames)
# load the list into a dataframe
df = DataFrame (train_list,columns=['emails'])
```
一旦我将它放入数据框中,我就将电子邮件解析为主题和正文。感谢大家的帮助。
我正在使用 Enron 数据集解决机器学习问题。我想将所有垃圾邮件文件合并到一个 csv 文件中,并将所有 ham 文件合并到另一个 csv 文件中以供进一步分析。 我正在使用此处列出的数据集:https://github.com/crossedbanana/Enron-Email-Classification
我使用下面的代码合并了电子邮件,并且能够合并它们。但是,当我尝试读取 csv 文件并将其加载到 pandas 时,由于 ParserError: Error tokenizing data. C error: Expected 1 fields in line 8, saw 2
将 txt 格式的电子邮件文件合并到 csv 中的代码
import os
for f in glob.glob("./dataset_temp/spam/*.txt"):
os.system("cat "+f+" >> OutFile1.csv")
Code to load into pandas:
```# reading the csv into pandas
emails = pd.read_csv('OutFile1.csv')
print(emails.shape)```
1. How can I get rid of the parser error? this is occuring due to commas present in the email messages I think.
2. How can I just load each email message into pandas with just the email body?
This is how the email format looks like(an example of a text file in the spam folder)
The commas in line 3 are causing a problem while loading into pandas
*Subject: your prescription is ready . . oxwq s f e
low cost prescription medications
soma , ultram , adipex , vicodin many more
prescribed online and shipped
overnight to your door ! !
one of our us licensed physicians will write an
fda approved prescription for you and ship your
order overnight via a us licensed pharmacy direct
to your doorstep . . . . fast and secure ! !
click here !
no thanks , please take me off your list
ogrg z
lqlokeolnq
lnu*
Thanks for any help.
您可以使用 excel 文件来代替读取和写入 CSV 文件中的数据。 所以你不会因为','(逗号)而得到任何错误。 只需将 csv 替换为 excel.
这是一个例子:
import os
import pandas as pd
import codecs
# Function to create list of emails.
def create_email_list(folder_path):
email_list = []
folder = os.listdir(folder_path)#provide folder path, if the folder is in same directory provide only the folder name
for txt in folder:
file_name = fr'{folder_path}/{txt}'
#read emails
with codecs.open(file_name, 'r', encoding='utf-8',errors='ignore') as f:
email = f.read()
email_list.append(email)
return email_list
spam_list = create_email_list('spam')#calling the function for reading spam
spam_df = pd.DataFrame(spam_list)#creating a dataframe of spam
spam_df.to_excel('spam.xlsx')#creating excel file of spam
ham_list = create_email_list('ham')#calling the function for reading ham
ham_df = pd.DataFrame(ham_list)#creating a dataframe of spam
ham_df.to_excel('ham.xlsx')#creating excel file of ham
您只需要在函数中传递文件夹路径即可(文件夹名称是文件夹在同一目录下)。此代码将创建 excel 个文件。
为避免 ,
出现问题,您可以使用不同的分隔符(例如 |
)或在字段周围加上引号:
"soma , ultram , adipex , vicodin many more"
如果字段中有引号,您必须用另一个引号将其转义:
"soma , ultram , ""adipex"" , vicodin many more"
但是,您的示例将对每封邮件中的每一行都有一个 csv 记录。每封电子邮件有一条记录可能更合乎逻辑:
subject,body
your prescription is ready . . oxwq s f e,"low cost prescription medications
soma , ultram , adipex , vicodin many more
prescribed online and shipped
overnight to your door ! !
one of our us licensed physicians will write an
fda approved prescription for you and ship your
order overnight via a us licensed pharmacy direct
to your doorstep . . . . fast and secure ! !
click here !
no thanks , please take me off your list
ogrg z
lqlokeolnq
lnu"
test subject2,"test
body 2"
以上示例为您提供了一个包含 2 列的 table:subject
和 body
,其中 body
是一个用双引号括起来的多行字段。
我这样解决了我的问题。先阅读所有的txt文件
```
BASE_DIR = './'
SPAM_DIR = './spam'
def load_text_file(filenames):
text_list = []
for filename in filenames:
with codecs.open(filename, "r", "utf-8", errors = 'ignore') as f:
text = f.read().replace('\r\n', ' ')
text_list.append(text)
return text_list
# add it to a list with filenames
ham_filenames = glob.glob( BASE_DIR + HAM_DIR + '*.txt')
ham_list = load_text_file(ham_filenames)
# load the list into a dataframe
df = DataFrame (train_list,columns=['emails'])
```
一旦我将它放入数据框中,我就将电子邮件解析为主题和正文。感谢大家的帮助。