电子邮件分类器根据时间对电子邮件进行分类
Email Classifier to classify emails according to the time
我必须设计一个程序,使用 Python 和 Pandas 库将电子邮件分类为垃圾邮件或非垃圾邮件。我是 Python 的新手,所以我决定选择一些简单的东西。
我已经根据电子邮件的主题将电子邮件分类为垃圾邮件或非垃圾邮件。对于我的第二个任务,我必须根据时间将电子邮件分类为垃圾邮件或非垃圾邮件。如果电子邮件在 ('Friday and 'Saturday') 收到,则应将其归类为垃圾邮件。否则非垃圾邮件。我真的不知道该怎么做。我试图搜索但最终一无所获。这是我最后的希望,如果有人能帮助我,我将不胜感激。
这是 excel 文件的屏幕截图
import pandas as pd
ExcelFile = pd.read_excel(r'C:\Users\Documents\Email Table.xlsx')
Subject = pd.DataFrame(ExcelFile, columns=['Subject'])
def spam(Subject):
A = len(ExcelFile[ExcelFile['Subject'].isnull()])
print("Number of spam emails ",A)
print(ExcelFile[ExcelFile['Subject'].isnull()])
spam(Subject)
有一百万种方法可以做到这一点,但我会这样做。为了清楚起见,我提供了评论和一些命名约定,您可以根据需要进行修改以满足您的特定需求
#All necessary imports
import pandas as pd
import numpy as np
import datetime
#Create same sample data (just made this up nothing specific)
data = {
'From' : ['test@gmail.com', 'test1@gmail.com', 'test2@gmail.com', 'test3@gmail.com', 'test4@gmail.com'],
'Subject' : ['Free Stuff', 'Buy Stuff', np.nan,'More Free Stuff', 'More Buy Stuff'],
'Dates' : ['2022-05-18 01:00:00', '2022-05-18 03:00:00', '2022-05-19 08:00:00', '2022-05-20 01:00:00', '2022-05-21 10:00:00']
}
#Create a Dataframe with the data
df = pd.DataFrame(data)
#Set all nulls/nones/NaN to a blank string
df.fillna('', inplace = True)
#Set the Dates column to a date column with YYYY-MM-DD HH:MM:SS format
df['Dates'] = pd.to_datetime(df['Dates'], format = '%Y-%m-%d %H:%M:%S')
#Create a column that will identify the what day the Dates column is on
df['Day'] = df['Dates'].dt.day_name()
#Write a np.select() to determine if the Subject column is null or if the Day column is on Friday or Saturday
#This is where you specify which days are spam days
list_of_spam_days = ['Friday', 'Saturday']
#List of conditions to test of true or false (np.nan is equivilent of a null)
condition_list = [df['Subject'] == '', df['Day'].isin(list_of_spam_days)]
#Mirroring the condition_list from before what should happen if the condition is true
true_list = ['Spam', 'Spam']
#Make a new column to which holds all of the results of our condition and true lists
#The final 'Not Spam' is the default if the condition list was not satisfied
df['Spam or Not Spam'] = np.select(condition_list, true_list, 'Not Spam')
df
我必须设计一个程序,使用 Python 和 Pandas 库将电子邮件分类为垃圾邮件或非垃圾邮件。我是 Python 的新手,所以我决定选择一些简单的东西。 我已经根据电子邮件的主题将电子邮件分类为垃圾邮件或非垃圾邮件。对于我的第二个任务,我必须根据时间将电子邮件分类为垃圾邮件或非垃圾邮件。如果电子邮件在 ('Friday and 'Saturday') 收到,则应将其归类为垃圾邮件。否则非垃圾邮件。我真的不知道该怎么做。我试图搜索但最终一无所获。这是我最后的希望,如果有人能帮助我,我将不胜感激。
这是 excel 文件的屏幕截图
import pandas as pd
ExcelFile = pd.read_excel(r'C:\Users\Documents\Email Table.xlsx')
Subject = pd.DataFrame(ExcelFile, columns=['Subject'])
def spam(Subject):
A = len(ExcelFile[ExcelFile['Subject'].isnull()])
print("Number of spam emails ",A)
print(ExcelFile[ExcelFile['Subject'].isnull()])
spam(Subject)
有一百万种方法可以做到这一点,但我会这样做。为了清楚起见,我提供了评论和一些命名约定,您可以根据需要进行修改以满足您的特定需求
#All necessary imports
import pandas as pd
import numpy as np
import datetime
#Create same sample data (just made this up nothing specific)
data = {
'From' : ['test@gmail.com', 'test1@gmail.com', 'test2@gmail.com', 'test3@gmail.com', 'test4@gmail.com'],
'Subject' : ['Free Stuff', 'Buy Stuff', np.nan,'More Free Stuff', 'More Buy Stuff'],
'Dates' : ['2022-05-18 01:00:00', '2022-05-18 03:00:00', '2022-05-19 08:00:00', '2022-05-20 01:00:00', '2022-05-21 10:00:00']
}
#Create a Dataframe with the data
df = pd.DataFrame(data)
#Set all nulls/nones/NaN to a blank string
df.fillna('', inplace = True)
#Set the Dates column to a date column with YYYY-MM-DD HH:MM:SS format
df['Dates'] = pd.to_datetime(df['Dates'], format = '%Y-%m-%d %H:%M:%S')
#Create a column that will identify the what day the Dates column is on
df['Day'] = df['Dates'].dt.day_name()
#Write a np.select() to determine if the Subject column is null or if the Day column is on Friday or Saturday
#This is where you specify which days are spam days
list_of_spam_days = ['Friday', 'Saturday']
#List of conditions to test of true or false (np.nan is equivilent of a null)
condition_list = [df['Subject'] == '', df['Day'].isin(list_of_spam_days)]
#Mirroring the condition_list from before what should happen if the condition is true
true_list = ['Spam', 'Spam']
#Make a new column to which holds all of the results of our condition and true lists
#The final 'Not Spam' is the default if the condition list was not satisfied
df['Spam or Not Spam'] = np.select(condition_list, true_list, 'Not Spam')
df