如何从 python 中的多部分电子邮件中抓取 link
How to scrape a link from a multipart email in python
我有一个程序可以登录到指定的 gmail 帐户并获取选定收件箱中的所有电子邮件,这些电子邮件是从您在运行时输入的电子邮件发送的。
我希望能够从每封电子邮件中获取所有链接并将它们附加到列表中,这样我就可以在将它们输出到另一个文件之前过滤掉我不需要的链接。我正在使用正则表达式来执行此操作,这需要我将有效负载转换为字符串。问题是我使用的正则表达式不适用于 findall(),它只在我使用 search() 时有效(我不太熟悉正则表达式)。我想知道是否有更好的方法来从电子邮件中提取所有链接,而不需要我乱用正则表达式?
我的代码目前是这样的:
print(f'[{Mail.timestamp}] Scanning inbox')
sys.stdout.write(Style.RESET)
self.search_mail_status, self.amount_matching_criteria = self.login_session.search(Mail.CHARSET,search_criteria)
if self.amount_matching_criteria == 0 or self.amount_matching_criteria == '0':
print(f'[{Mail.timestamp}] No mails from that email address could be found...')
Mail.enter_to_continue()
import main
main.main_wrapper()
else:
pattern = '(?P<url>https?://[^\s]+)'
prog = re.compile(pattern)
self.amount_matching_criteria = self.amount_matching_criteria[0]
self.amount_matching_criteria_str = str(self.amount_matching_criteria)
num_mails = re.search(r"\d.+",self.amount_matching_criteria_str)
num_mails = ((num_mails.group())[:-1]).split(' ')
sys.stdout.write(Style.GREEN)
print(f'[{Mail.timestamp}] Status code of {self.search_mail_status}')
sys.stdout.write(Style.RESET)
sys.stdout.write(Style.YELLOW)
print(f'[{Mail.timestamp}] Found {len(num_mails)} emails')
sys.stdout.write(Style.RESET)
num_mails = self.amount_matching_criteria.split()
for message_num in num_mails:
individual_response_code, individual_response_data = self.login_session.fetch(message_num, '(RFC822)')
message = email.message_from_bytes(individual_response_data[0][1])
if message.is_multipart():
print('multipart')
multipart_payload = message.get_payload()
for sub_message in multipart_payload:
string_payload = str(sub_message.get_payload())
print(prog.search(string_payload).group("url"))
最后使用这个带有递归函数和正则表达式的 for 循环来获取链接,然后我删除了所有没有子字符串的链接,您可以在附加到集合之前在程序中输入这些子字符串
for message_num in self.amount_matching_criteria.split():
counter += 1
_, self.individual_response_data = self.login_session.fetch(message_num, '(RFC822)')
self.raw = email.message_from_bytes(self.individual_response_data[0][1])
raw = self.raw
self.scraped_email_value = email.message_from_bytes(Mail.scrape_email(raw))
self.scraped_email_value = str(self.scraped_email_value)
self.returned_links = prog.findall(self.scraped_email_value)
for i in self.returned_links:
if self.substring_filter in i:
self.link_set.add(i)
self.timestamp = time.strftime('%H:%M:%S')
print(f'[{self.timestamp}] Links scraped: [{counter}/{len(num_mails)}]')
使用的函数:
def scrape_email(raw):
if raw.is_multipart():
return Mail.scrape_email(raw.get_payload(0))
else:
return raw.get_payload(None,True)
我有一个程序可以登录到指定的 gmail 帐户并获取选定收件箱中的所有电子邮件,这些电子邮件是从您在运行时输入的电子邮件发送的。
我希望能够从每封电子邮件中获取所有链接并将它们附加到列表中,这样我就可以在将它们输出到另一个文件之前过滤掉我不需要的链接。我正在使用正则表达式来执行此操作,这需要我将有效负载转换为字符串。问题是我使用的正则表达式不适用于 findall(),它只在我使用 search() 时有效(我不太熟悉正则表达式)。我想知道是否有更好的方法来从电子邮件中提取所有链接,而不需要我乱用正则表达式?
我的代码目前是这样的:
print(f'[{Mail.timestamp}] Scanning inbox')
sys.stdout.write(Style.RESET)
self.search_mail_status, self.amount_matching_criteria = self.login_session.search(Mail.CHARSET,search_criteria)
if self.amount_matching_criteria == 0 or self.amount_matching_criteria == '0':
print(f'[{Mail.timestamp}] No mails from that email address could be found...')
Mail.enter_to_continue()
import main
main.main_wrapper()
else:
pattern = '(?P<url>https?://[^\s]+)'
prog = re.compile(pattern)
self.amount_matching_criteria = self.amount_matching_criteria[0]
self.amount_matching_criteria_str = str(self.amount_matching_criteria)
num_mails = re.search(r"\d.+",self.amount_matching_criteria_str)
num_mails = ((num_mails.group())[:-1]).split(' ')
sys.stdout.write(Style.GREEN)
print(f'[{Mail.timestamp}] Status code of {self.search_mail_status}')
sys.stdout.write(Style.RESET)
sys.stdout.write(Style.YELLOW)
print(f'[{Mail.timestamp}] Found {len(num_mails)} emails')
sys.stdout.write(Style.RESET)
num_mails = self.amount_matching_criteria.split()
for message_num in num_mails:
individual_response_code, individual_response_data = self.login_session.fetch(message_num, '(RFC822)')
message = email.message_from_bytes(individual_response_data[0][1])
if message.is_multipart():
print('multipart')
multipart_payload = message.get_payload()
for sub_message in multipart_payload:
string_payload = str(sub_message.get_payload())
print(prog.search(string_payload).group("url"))
最后使用这个带有递归函数和正则表达式的 for 循环来获取链接,然后我删除了所有没有子字符串的链接,您可以在附加到集合之前在程序中输入这些子字符串
for message_num in self.amount_matching_criteria.split():
counter += 1
_, self.individual_response_data = self.login_session.fetch(message_num, '(RFC822)')
self.raw = email.message_from_bytes(self.individual_response_data[0][1])
raw = self.raw
self.scraped_email_value = email.message_from_bytes(Mail.scrape_email(raw))
self.scraped_email_value = str(self.scraped_email_value)
self.returned_links = prog.findall(self.scraped_email_value)
for i in self.returned_links:
if self.substring_filter in i:
self.link_set.add(i)
self.timestamp = time.strftime('%H:%M:%S')
print(f'[{self.timestamp}] Links scraped: [{counter}/{len(num_mails)}]')
使用的函数:
def scrape_email(raw):
if raw.is_multipart():
return Mail.scrape_email(raw.get_payload(0))
else:
return raw.get_payload(None,True)