使用 Python 从 outlook 电子邮件正文中提取数字

Question

我每小时都会收到电子邮件提醒，告诉我公司在过去一小时内赚取了多少收入。我想将此信息提取到 pandas 数据框中，以便我可以运行对其进行一些分析。

我的问题是我不知道如何以可用的格式从电子邮件正文中提取数据。我想我需要使用正则表达式，但我对它们不太熟悉。

这是我目前所拥有的：

import os
import pandas as pd
import datetime as dt
import win32com.client

outlook = win32com.client.Dispatch("Outlook.Application").GetNamespace("MAPI")
inbox = outlook.GetDefaultFolder(6)
messages = inbox.Items

#Empty Lists
email_subject = []
email_date = []
email_content = []

#find emails

for message in messages:
    if message.SenderEmailAddress == 'oracle@xyz.com' and message.Subject.startswith('Demand'):
        email_subject.append(message.Subject)
        email_date.append(message.senton.date()) 
        email_content.append(message.body)

email_content 列表如下所示：

'                                                                                                                   \r\nDemand: ,225 (-47%)\t                                                                            \r\n                                                                                                                       \r\nOrders: 515 (-53%)\t                                                                                \r\nUnits: 849 (-59%)\t                                                                                 \r\n                                                                                                                       \r\nAOV:  (12%)                                                                                                          \r\nAUR:  (30%)                                                                                                          \r\n                                                                                                                       \r\nOrders with Promo Code: 3%                                                                                              \r\nAverage Discount: 21%                                                                                             '

谁能告诉我如何拆分其内容，以便在单独的列中获取需求、订单和单位的 int 值？

谢谢！

Answer 1

您可以结合使用 string.split() 和 string.strip() 来首先单独提取每一行。

string = email_content
lines = string.split('\r\n')
lines_stripped = []
for line in lines:
    line = line.strip()
    if line != '':
        lines_stripped.append(line)

这给你一个像这样的数组：

['Demand: ,225 (-47%)', 'Orders: 515 (-53%)', 'Units: 849 (-59%)', 'AOV:  (12%)', 'AUR:  (30%)', 'Orders with Promo Code: 3%', 'Average Discount: 21%']

您也可以通过更紧凑的 (pythonic) 方式实现相同的结果：

lines_stripped = [line.strip() for line in string.split('\r\n') if line.strip() != '']

一旦你有了这个数组，你就可以使用正确猜测的正则表达式来提取值。我建议 https://regexr.com/ 试验您的正则表达式。

经过一些快速试验后，r'([\S\s]*):\s*(\S*)\s*\(?(\S*)\)?' 应该可以。

这是根据我们在上面创建的 lines_stripped 生成字典的代码：

import re
regex = r'([\S\s]*):\s*(\S*)\s*\(?(\S*)\)?'
matched_dict = {}
for line in lines_stripped:
    match = re.match(regex, line)
    matched_dict[match.groups()[0]] = (match.groups()[1], match.groups()[2])

print(matched_dict)

这会产生以下输出：

{'AOV': ('', '12%)'),
 'AUR': ('', '30%)'),
 'Average Discount': ('21%', ''),
 'Demand': (',225', '-47%)'),
 'Orders': ('515', '-53%)'),
 'Orders with Promo Code': ('3%', ''),
 'Units': ('849', '-59%)')}

您询问了单位、订单和需求，所以这里是提取：

# Remove the dollar sign before converting to float
# Replace , with empty string
demand_string = matched_dict['Demand'][0].strip('$').replace(',', '')
print(int(demand_string))
print(int(matched_dict['Orders'][0]))
print(int(matched_dict['Units'][0]))

如您所见，Demand 有点复杂，因为它包含一些额外的字符 python 转换为 int 时无法解码。

这是这 3 幅印刷品的最终输出：

41225
515
849

希望我回答了你的问题！如果您对 regex 有更多疑问，我鼓励您使用 regexr 进行体验，它构建得非常好！

编辑：正则表达式中似乎存在一个小问题，导致最后一个“)”被包含在最后一组中。不过，这不会影响您的问题！

使用 Python 从 outlook 电子邮件正文中提取数字

Extracting numbers from outlook email body with Python

python

regex

outlook-2010

python-3.x

pandas