使用 Python 从电子邮件中提取 HTML table 数据到 csv 文件,第 1 列值到行 headers
Extracting HTML table data from email to csv file, 1st column values to row headers, using Python
我正在尝试通读 outlook 文件夹并获取 ReceivedTime、CC、Subject、HTMLBody,但将 table 提取到列中。我可以将 1) ReceivedTime、CC、Subject、HTMLBody 拉入数据帧,我可以执行 2) 将 HTMLBody tables 提取到数据帧中,但我一直坚持同时执行 1) 和 2)。
当前代码:
import win32com.client
import pandas as pd
from bs4 import BeautifulSoup
outlook = win32com.client.Dispatch("Outlook.Application")
mapi = outlook.GetNamespace("MAPI")
inbox = mapi.Folders[User@email.com'].Folders['Inbox'].Folders['Subfolder Name']
Mail_Messages = inbox.Items
for mail in Mail_Messages:
receivedtime = mail.ReceivedTime.strftime('%Y-%m-%d %H:%M:%S')
cc = mail.CC
body = mail.HTMLBody
html_body = BeautifulSoup(body,"lxml")
html_tables = html_body.find_all('table')[0]
df = pd.read_html(str(html_tables),header=None)[0]
display(df)
当前数据框显示在下方。但我还想要相关的 ReceivedTime、CC 和 Subject。
0
1
0
Report Name
Report.pdf
1
Team Name
Team A
2
Project Name
Project A
3
Unique ID Number
123456789
4
Due Date
1/1/2021
但希望第 [0] 列改为第 headers 行。因此,当阅读每封电子邮件时,它会为收件箱子文件夹中的所有电子邮件生成一个如下所示的数据框:
0
Report Name
Team Name
Project Name
Unique ID Number
Due Date
ReceivedTime
CC
Subject
1
Report.pdf
Team A
Project A
123456789
1/5/2021
1/1/2021 4:38:44 AM
User1@email.com, User2@email.com
Action Required:Report A Coming due
2
3
4
但是我卡住了,仍然是初学者 pythoner,但我看到的所有其他帖子并没有完全让我明白我想做什么。感谢您对此提供的所有帮助。
试试这个:
import win32com.client
import pandas as pd
from bs4 import BeautifulSoup
from pprint import pprint
outlook = win32com.client.Dispatch("Outlook.Application")
mapi = outlook.GetNamespace("MAPI")
inbox = mapi.Folders['User@email.com'].Folders['Inbox'].Folders['Subfolder Name']
Mail_Messages = inbox.Items
# a list where contents of each e-mail - CC, receiv.time and subject will have been put
contents = []
column_names = ['Report Name', 'Team Name', 'Project Name', 'Unique ID Number', 'Due Date', 'ReceivedTime', 'CC', 'Subject']
for mail in Mail_Messages:
body = mail.HTMLBody
html_body = BeautifulSoup(body, "lxml")
html_tables = html_body.find_all('table')
# uncomment following lines if you want to have column names defined programatically rather than hardcoded
# column_names = pd.read_html(str(html_tables), header=None)[0][0]
# column_names = column_names.tolist()
# column_names.append("CC")
# column_names.append("Received Time")
# column_names.append("Subject")
# a list containing a single e-mail data - html table, CC, receivedTime and subject
row = pd.read_html(str(html_tables), header=None)[0][1]
row = row.tolist()
row.append(mail.CC)
row.append(mail.ReceivedTime.strftime('%Y-%m-%d %H:%M:%S'))
row.append(mail.Subject)
# appending each full row to a list
contents.append(row)
# and finally converting a list into dataframe
df = pd.DataFrame(contents, columns=column_names)
pprint(df)
我正在尝试通读 outlook 文件夹并获取 ReceivedTime、CC、Subject、HTMLBody,但将 table 提取到列中。我可以将 1) ReceivedTime、CC、Subject、HTMLBody 拉入数据帧,我可以执行 2) 将 HTMLBody tables 提取到数据帧中,但我一直坚持同时执行 1) 和 2)。
当前代码:
import win32com.client
import pandas as pd
from bs4 import BeautifulSoup
outlook = win32com.client.Dispatch("Outlook.Application")
mapi = outlook.GetNamespace("MAPI")
inbox = mapi.Folders[User@email.com'].Folders['Inbox'].Folders['Subfolder Name']
Mail_Messages = inbox.Items
for mail in Mail_Messages:
receivedtime = mail.ReceivedTime.strftime('%Y-%m-%d %H:%M:%S')
cc = mail.CC
body = mail.HTMLBody
html_body = BeautifulSoup(body,"lxml")
html_tables = html_body.find_all('table')[0]
df = pd.read_html(str(html_tables),header=None)[0]
display(df)
当前数据框显示在下方。但我还想要相关的 ReceivedTime、CC 和 Subject。
0 | 1 | |
---|---|---|
0 | Report Name | Report.pdf |
1 | Team Name | Team A |
2 | Project Name | Project A |
3 | Unique ID Number | 123456789 |
4 | Due Date | 1/1/2021 |
但希望第 [0] 列改为第 headers 行。因此,当阅读每封电子邮件时,它会为收件箱子文件夹中的所有电子邮件生成一个如下所示的数据框:
0 | Report Name | Team Name | Project Name | Unique ID Number | Due Date | ReceivedTime | CC | Subject |
---|---|---|---|---|---|---|---|---|
1 | Report.pdf | Team A | Project A | 123456789 | 1/5/2021 | 1/1/2021 4:38:44 AM | User1@email.com, User2@email.com | Action Required:Report A Coming due |
2 | ||||||||
3 | ||||||||
4 |
但是我卡住了,仍然是初学者 pythoner,但我看到的所有其他帖子并没有完全让我明白我想做什么。感谢您对此提供的所有帮助。
试试这个:
import win32com.client
import pandas as pd
from bs4 import BeautifulSoup
from pprint import pprint
outlook = win32com.client.Dispatch("Outlook.Application")
mapi = outlook.GetNamespace("MAPI")
inbox = mapi.Folders['User@email.com'].Folders['Inbox'].Folders['Subfolder Name']
Mail_Messages = inbox.Items
# a list where contents of each e-mail - CC, receiv.time and subject will have been put
contents = []
column_names = ['Report Name', 'Team Name', 'Project Name', 'Unique ID Number', 'Due Date', 'ReceivedTime', 'CC', 'Subject']
for mail in Mail_Messages:
body = mail.HTMLBody
html_body = BeautifulSoup(body, "lxml")
html_tables = html_body.find_all('table')
# uncomment following lines if you want to have column names defined programatically rather than hardcoded
# column_names = pd.read_html(str(html_tables), header=None)[0][0]
# column_names = column_names.tolist()
# column_names.append("CC")
# column_names.append("Received Time")
# column_names.append("Subject")
# a list containing a single e-mail data - html table, CC, receivedTime and subject
row = pd.read_html(str(html_tables), header=None)[0][1]
row = row.tolist()
row.append(mail.CC)
row.append(mail.ReceivedTime.strftime('%Y-%m-%d %H:%M:%S'))
row.append(mail.Subject)
# appending each full row to a list
contents.append(row)
# and finally converting a list into dataframe
df = pd.DataFrame(contents, columns=column_names)
pprint(df)