Python 网页抓取 [错误 10060]

Question

我正在努力获取我的代码，该代码从网络上抓取 HTML table 信息，以处理 ShipURL.txt 文件中保存的网站列表。该代码从 ShipURL 读取网页地址，然后转到 link 并下载 table 数据并将其保存到 csv。但我的问题是程序无法完成，因为错误“连接尝试失败，因为连接方在一段时间后没有正确响应，或者建立的连接失败，因为连接的主机未能响应" 出现在中间，程序停止。据我了解，现在我需要增加请求时间、使用代理或进行 try 语句。我浏览了一些关于同一个问题的答案，但作为一个新手，我发现它很难理解。任何帮助将不胜感激。

ShipURL.txt https://dl.dropboxusercontent.com/u/110612863/ShipURL.txt

# -*- coding: utf-8 -*-
fm = open('ShipURL.txt', 'r')
Shiplinks = fm.readlines()

import csv
from urllib import urlopen
from bs4 import BeautifulSoup
import re
for line in Shiplinks:
    website = re.findall(r'(https?://\S+)', line)
    website = "".join(str(x) for x in website)
    if website != "":

    with open('ShipData.csv','wb')as f:                         #Creates an empty csv file to which assign values.
        writer = csv.writer(f)
        shipUrl = website
        shipPage = urlopen(shipUrl)

        soup = BeautifulSoup(shipPage, "html.parser")           #Read the web page HTML
        table = soup.find_all("table", { "class" : "table1" })  #Finds table with class table1
        List = []
        columnRow = ""
        valueRow = ""
        Values = []
        for mytable in table:                                   #Loops tables with class table1
            table_body = mytable.find('tbody')                  #Finds tbody section in table
            try:                                                #If tbody exists
                rows = table_body.find_all('tr')                #Finds all rows
                for tr in rows:                                 #Loops rows
                    cols = tr.find_all('td')                    #Finds the columns
                    i = 1                                       #Variable to control the lines
                    for td in cols:                             #Loops the columns
    ##                    print td.text                           #Displays the output
                        co = td.text                            #Saves the column to a variable
    ##                    writer.writerow([co])                 Writes the variable in CSV file row
                        if i == 1:                              #Checks the control variable, if it equals to 1

                            if td.text[ -1] == ":":
                                # võtab kooloni maha ja lisab koma järele
                                columnRow += td.text.strip(":") + "," # Tekkis mõte, et vb oleks lihtsam kohe ühte string panna
                                List.append(td.text)                #.. takes the column value and assigns it to a list called 'List' and..
                                i+=1                                #..Increments i by one

                        else:
                            # võtab reavahetused maha ja lisab koma stringile
                            valueRow += td.text.strip("\n") + ","
                            Values.append(td.text)              #Takes the second columns value and assigns it to a list called Values
                        #print List                             #Checking stuff
                        #print Values                           #Checking stuff


            except:
                print "no tbody"
        # Prindime pealkirjad ja väärtused koos reavahetusega välja ka :)
        print columnRow.strip(",")
        print "\n"
        print valueRow.strip(",")
        # encode'ing hakkas jälle kiusama
        # Kirjutab esimeseks reaks veeru pealkirjad ja teiseks väärtused
        writer.writerow([columnRow.encode('utf-8')])
        writer.writerow([valueRow.encode('utf-8')])

Answer 1

网站通过阻止来自单个 IP 的连续访问来保护自己免受 DDOS 攻击。

您应该在每次访问之间设置休眠时间，或者在每 10 次访问或 20 次或 50 次访问时设置休眠时间。

或者您可能必须通过 Tor 网络或任何其他方式匿名访问

Answer 2

我会用 try/catch 包装你的 urlopen 调用。像这样：

try:
  shipPage = urlopen(shipUrl)
except Error as e:
  print e

这至少可以帮助您找出错误发生的位置。没有额外的文件，很难排除故障，否则。

Python errors documentation

Answer 3

找到了一些关于此 link 的重要信息： How to retry after exception in python? 这基本上是我的连接问题，所以我决定尝试直到成功。目前它正在工作。解决了此代码的问题：

 while True:
                try:
                    shipPage = urllib2.urlopen(shipUrl,timeout=5)
                except Exception as e:
                    continue
                break

但是我真的要感谢这里的每一个人，你们帮助我更好地理解了这个问题！

Python 网页抓取 [错误 10060]

Python web scraping [Error 10060]

python

runtime-error

web-scraping