使用 Python 从电子邮件中提取 HTML table 数据到 csv 文件,第 1 列值到行 headers

Extracting HTML table data from email to csv file, 1st column values to row headers, using Python

我正在尝试通读 outlook 文件夹并获取 ReceivedTime、CC、Subject、HTMLBody,但将 table 提取到列中。我可以将 1) ReceivedTime、CC、Subject、HTMLBody 拉入数据帧,我可以执行 2) 将 HTMLBody tables 提取到数据帧中,但我一直坚持同时执行 1) 和 2)。

当前代码:

import win32com.client
import pandas as pd
from bs4 import BeautifulSoup


outlook = win32com.client.Dispatch("Outlook.Application")
mapi = outlook.GetNamespace("MAPI")

inbox = mapi.Folders[User@email.com'].Folders['Inbox'].Folders['Subfolder Name']
Mail_Messages = inbox.Items

for mail in Mail_Messages:
     receivedtime = mail.ReceivedTime.strftime('%Y-%m-%d %H:%M:%S')
     cc = mail.CC
     body = mail.HTMLBody
     html_body = BeautifulSoup(body,"lxml")
     html_tables = html_body.find_all('table')[0]

df = pd.read_html(str(html_tables),header=None)[0]
display(df)

当前数据框显示在下方。但我还想要相关的 ReceivedTime、CC 和 Subject。

0 1
0 Report Name Report.pdf
1 Team Name Team A
2 Project Name Project A
3 Unique ID Number 123456789
4 Due Date 1/1/2021

但希望第 [0] 列改为第 headers 行。因此,当阅读每封电子邮件时,它会为收件箱子文件夹中的所有电子邮件生成一个如下所示的数据框:

0 Report Name Team Name Project Name Unique ID Number Due Date ReceivedTime CC Subject
1 Report.pdf Team A Project A 123456789 1/5/2021 1/1/2021 4:38:44 AM User1@email.com, User2@email.com Action Required:Report A Coming due
2
3
4

但是我卡住了,仍然是初学者 pythoner,但我看到的所有其他帖子并没有完全让我明白我想做什么。感谢您对此提供的所有帮助。

试试这个:

import win32com.client
import pandas as pd
from bs4 import BeautifulSoup
from pprint import pprint

outlook = win32com.client.Dispatch("Outlook.Application")
mapi = outlook.GetNamespace("MAPI")

inbox = mapi.Folders['User@email.com'].Folders['Inbox'].Folders['Subfolder Name']
Mail_Messages = inbox.Items

# a list where contents of each e-mail - CC, receiv.time and subject will have been put
contents = []
column_names = ['Report Name', 'Team Name', 'Project Name', 'Unique ID Number', 'Due Date', 'ReceivedTime', 'CC', 'Subject']

for mail in Mail_Messages:

    body = mail.HTMLBody
    html_body = BeautifulSoup(body, "lxml")
    html_tables = html_body.find_all('table')

    # uncomment following lines if you want to have column names defined programatically rather than hardcoded
    # column_names = pd.read_html(str(html_tables), header=None)[0][0]
    # column_names = column_names.tolist()
    # column_names.append("CC")
    # column_names.append("Received Time")
    # column_names.append("Subject")

    # a list containing a single e-mail data - html table, CC, receivedTime and subject
    row = pd.read_html(str(html_tables), header=None)[0][1]
    row = row.tolist()
    row.append(mail.CC)
    row.append(mail.ReceivedTime.strftime('%Y-%m-%d %H:%M:%S'))
    row.append(mail.Subject)

    # appending each full row to a list
    contents.append(row)


# and finally converting a list into dataframe
df = pd.DataFrame(contents, columns=column_names)

pprint(df)