从网站上抓取 python 3.x 的电子邮件

Scrape emails with python 3.x from websites

我有一个脚本应该获取网站列表,并从那里搜索电子邮件(见下面的代码)。 每次它遇到一些错误,例如“网站被禁止”,或“服务暂时不可用”等。脚本将重新开始。

# -*- coding: utf-8 -*-

import urllib.request, urllib.error
import re
import csv
import pandas as pd
import os
import ssl

# 1: Get input file path from user '.../Documents/upw/websites.csv'
user_input = input("Enter the path of your file: ")

# If input file doesn't exist
if not os.path.exists(user_input):
    print("File not found, verify the location - ", str(user_input))


def sites(e):
    pass


while True:
    try:
        # 2. read file
        df = pd.read_csv(user_input)

        # 3. create the output csv file
        with open('Emails.csv', mode='w', newline='') as file:
            csv_writer = csv.writer(file, delimiter=',')
            csv_writer.writerow(['Website', 'Email'])

        # 4. Get websites
        for site in list(df['Website']):
            # print(site)
            gcontext = ssl.SSLContext(ssl.PROTOCOL_TLSv1_2)
            req = urllib.request.Request("http://" + site, headers={
                'User-Agent': "Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:15.0) Gecko/20100101 Firefox/15.0.1",
                # 'Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:15.0) Gecko/20100101 Firefox/15.0.1',
                'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
                'Accept-Charset': 'ISO-8859-1,utf-8;q=0.7,*;q=0.3',
                'Accept-Encoding': 'none',
                'Accept-Language': 'en-US,en;q=0.8',
                'Connection': 'keep-alive'
            })

            # 5. Scrape email id
            with urllib.request.urlopen(req, context=gcontext) as url:
                s = url.read().decode('utf-8', 'ignore')
                email = re.findall(r"[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\.[A-Za-z]{2,4}", s)
                print(email)

                # 6. Write the output
                with open('Emails.csv', mode='a', newline='') as file:
                    csv_writer = csv.writer(file, delimiter=',')
                    [csv_writer.writerow([site, item]) for item in email]

    except urllib.error.URLError as e:
        print("Failed to open URL {0} Reason: {1}".format(site, e.reason))

如果我删除代码:

def sites(e):
pass

while True

发生错误时脚本停止..

它应该做的是,如果从 web 端发生错误,则不要停止脚本,而是继续搜索。

我已经在网上搜索了一段时间,并查看了几个帖子,但看起来是错误的,因为我还没有找到解决方案..

如有任何帮助,我将不胜感激。

您的 while True: 循环有问题。它总是会重新启动,因为在 try 块中生成异常,然后循环转到 exception 块。之后它会再次循环,运行 try 从一开始就阻塞。

当你取出 While True: 时,当发生异常时,它只会完全停止进程,因为 try 块中会引发异常,从而停止 try 块执行,然后将继续执行 except 块,然后继续执行程序的其余部分。

你想要的是在你循环访问 df['Website'] 中的网站的循环中有 try 块,这样如果抛出异常它将移动到列表中的下一个网站而不是所有网站开始读取数据框并重新开始循环网站的方法。

    # 2. read file
df = pd.read_csv(user_input)

# 3. create the output csv file
with open('Emails.csv', mode='w', newline='') as file:
    csv_writer = csv.writer(file, delimiter=',')
    csv_writer.writerow(['Website', 'Email'])

# 4. Get websites
for site in list(df['Website']):
    try:
        # print(site)
        gcontext = ssl.SSLContext(ssl.PROTOCOL_TLSv1_2)
        req = urllib.request.Request("http://" + site, headers={
            'User-Agent': "Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:15.0) Gecko/20100101 Firefox/15.0.1",
            # 'Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:15.0) Gecko/20100101 Firefox/15.0.1',
            'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
            'Accept-Charset': 'ISO-8859-1,utf-8;q=0.7,*;q=0.3',
            'Accept-Encoding': 'none',
            'Accept-Language': 'en-US,en;q=0.8',
            'Connection': 'keep-alive'
        })

        # 5. Scrape email id
        with urllib.request.urlopen(req, context=gcontext) as url:
            s = url.read().decode('utf-8', 'ignore')
            email = re.findall(r"[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\.[A-Za-z]{2,4}", s)
            print(email)

            # 6. Write the output
            with open('Emails.csv', mode='a', newline='') as file:
                csv_writer = csv.writer(file, delimiter=',')
                [csv_writer.writerow([site, item]) for item in email]

    except urllib.error.URLError as e:
        print("Failed to open URL {0} Reason: {1}".format(site, e.reason))