使用 IMAP 获取电子邮件中的网址无法正常工作
Using IMAP to get urls in an email not working correctly
我正在尝试在电子邮件中查找特定的 url,我希望能够获取包含特定字符串的每个 url。这是我的代码:
import imaplib
import regex as re
def find_urls(string):
regex = r"(?i)\b((?:https?://|www\d{0,3}[.]|[a-z0-9.\-]+[.][a-z]{2,4}/)(?:[^\s()<>]+|\(([^\s()<>]+|(\([^\s()<>]+\)))*\))+(?:\(([^\s()<>]+|(\([^\s()<>]+\)))*\)|[^\s`!()\[\]{};:'\".,<>?«»“”‘’]))"
url = re.findall(regex,string)
return([x[0] for x in url])
def save_matching_urls(username, password, sender, url_string):
print("connecting to email, please wait...")
con = imaplib.IMAP4_SSL("imap.gmail.com")
con.login(username, password)
con.select('INBOX')
print("connected sucessfully, scraping email from " + sender)
(_, data) = con.search(None, '(FROM {0})'.format(sender.strip()))
ids = data[0].split()
print(str(len(ids)) +" emails found")
list_urls = []
list_good_urls = []
for mail in ids:
result, data = con.fetch(mail, '(RFC822)') # fetch the email headers and body (RFC822) for the given ID
raw_email = data[0][1]
email = raw_email.decode("utf-8").replace("\r", '').replace("\t", '').replace(" ", "").replace("\n", "")
list_url = find_urls(email)
for url in list_url:
if url_string in url:
list_good_urls.append(url)
print(str(len(list_good_urls)) + " urls found, saving...")
with open("{}_urls.txt".format(sender), mode="a", encoding="utf-8") as file:
for url in list_good_urls:
file.write(url + '\n')
print("urls saved !")
第一个函数是查找包含指定字符串的urls。另一个函数使用 imap 连接到邮件收件箱,然后尝试查找并保存来自特定发件人的匹配 urls。
为了说明这个问题,我使用了网站:http://ismyemailworking.com/,它会向您发送一封包含两个 url 的电子邮件,其中包含字符串:“email”,它们是:
http://ismyemailworking.com/Block.aspx
http://ismyemailworking.com/Contact.aspx
代码保存的url个(实际只找到url个)
IsMyEmailWorking.com/Block.aspx=20to=20temporarily=20block==20your=20email=20address=20for=201=20hour.=20This=20solves=20the=20problem==2099%=20of=20the=20time.=20If=20after=20this=20you=20continue=20to=20have==20problems=20please=20contact=20us=20via=20the=20contact=20link=20on=20our==20website=20at=20http://IsMyEmailWorking.com/Contact.aspx
我不知道代码的哪一部分导致了这个问题,任何帮助将不胜感激!
变体:
from imap_tools import MailBox, A
from magic import find_urls
with MailBox('imap.mail.com').login('test@mail.com', 'pwd', 'INBOX') as mailbox:
for msg in mailbox.fetch(A(all=True)):
body = msg.text or msg.html
urls = find_urls(body)
*此致,imap_tools
的作者
我正在尝试在电子邮件中查找特定的 url,我希望能够获取包含特定字符串的每个 url。这是我的代码:
import imaplib
import regex as re
def find_urls(string):
regex = r"(?i)\b((?:https?://|www\d{0,3}[.]|[a-z0-9.\-]+[.][a-z]{2,4}/)(?:[^\s()<>]+|\(([^\s()<>]+|(\([^\s()<>]+\)))*\))+(?:\(([^\s()<>]+|(\([^\s()<>]+\)))*\)|[^\s`!()\[\]{};:'\".,<>?«»“”‘’]))"
url = re.findall(regex,string)
return([x[0] for x in url])
def save_matching_urls(username, password, sender, url_string):
print("connecting to email, please wait...")
con = imaplib.IMAP4_SSL("imap.gmail.com")
con.login(username, password)
con.select('INBOX')
print("connected sucessfully, scraping email from " + sender)
(_, data) = con.search(None, '(FROM {0})'.format(sender.strip()))
ids = data[0].split()
print(str(len(ids)) +" emails found")
list_urls = []
list_good_urls = []
for mail in ids:
result, data = con.fetch(mail, '(RFC822)') # fetch the email headers and body (RFC822) for the given ID
raw_email = data[0][1]
email = raw_email.decode("utf-8").replace("\r", '').replace("\t", '').replace(" ", "").replace("\n", "")
list_url = find_urls(email)
for url in list_url:
if url_string in url:
list_good_urls.append(url)
print(str(len(list_good_urls)) + " urls found, saving...")
with open("{}_urls.txt".format(sender), mode="a", encoding="utf-8") as file:
for url in list_good_urls:
file.write(url + '\n')
print("urls saved !")
第一个函数是查找包含指定字符串的urls。另一个函数使用 imap 连接到邮件收件箱,然后尝试查找并保存来自特定发件人的匹配 urls。
为了说明这个问题,我使用了网站:http://ismyemailworking.com/,它会向您发送一封包含两个 url 的电子邮件,其中包含字符串:“email”,它们是:
http://ismyemailworking.com/Block.aspx
http://ismyemailworking.com/Contact.aspx
代码保存的url个(实际只找到url个)
IsMyEmailWorking.com/Block.aspx=20to=20temporarily=20block==20your=20email=20address=20for=201=20hour.=20This=20solves=20the=20problem==2099%=20of=20the=20time.=20If=20after=20this=20you=20continue=20to=20have==20problems=20please=20contact=20us=20via=20the=20contact=20link=20on=20our==20website=20at=20http://IsMyEmailWorking.com/Contact.aspx
我不知道代码的哪一部分导致了这个问题,任何帮助将不胜感激!
变体:
from imap_tools import MailBox, A
from magic import find_urls
with MailBox('imap.mail.com').login('test@mail.com', 'pwd', 'INBOX') as mailbox:
for msg in mailbox.fetch(A(all=True)):
body = msg.text or msg.html
urls = find_urls(body)
*此致,imap_tools
的作者