HTML Link 使用 BeautifulSoup 解析

Question

这是我的 Python 代码，我用它从页面链接 我作为参数发送。我正在使用 BeautifulSoup。此代码有时工作正常，有时会卡住！

import urllib
from bs4 import BeautifulSoup

rawHtml = ''
url = r'http://iasexamportal.com/civilservices/tag/voice-notes?page='
for i in range(1, 49):  
    #iterate url and capture content
    sock = urllib.urlopen(url+ str(i))
    html = sock.read()  
    sock.close()
    rawHtml += html
    print i

这里我正在打印循环变量以找出它被卡住的地方。它告诉我它随机卡在任何循环序列中。

soup = BeautifulSoup(rawHtml, 'html.parser')
t=''
for link in soup.find_all('a'):
    t += str(link.get('href')) + "</br>"
    #t += str(link) + "</br>"
f = open("Link.txt", 'w+')
f.write(t)
f.close()

可能是什么问题。是 socket 配置问题还是其他问题。

这是我得到的错误。我检查了这些链接 - python-gaierror-errno-11004,ioerror-errno-socket-error-errno-11004-getaddrinfo-failed 以获取解决方案。但我发现它没有太大帮助。

 d:\python>python ext.py
Traceback (most recent call last):
  File "ext.py", line 8, in <module>
    sock = urllib.urlopen(url+ str(i))
  File "d:\python\lib\urllib.py", line 87, in urlopen
    return opener.open(url)
  File "d:\python\lib\urllib.py", line 213, in open
    return getattr(self, name)(url)
  File "d:\python\lib\urllib.py", line 350, in open_http
    h.endheaders(data)
  File "d:\python\lib\httplib.py", line 1049, in endheaders
    self._send_output(message_body)
  File "d:\python\lib\httplib.py", line 893, in _send_output
    self.send(msg)
  File "d:\python\lib\httplib.py", line 855, in send
    self.connect()
  File "d:\python\lib\httplib.py", line 832, in connect
    self.timeout, self.source_address)
  File "d:\python\lib\socket.py", line 557, in create_connection
    for res in getaddrinfo(host, port, 0, SOCK_STREAM):
IOError: [Errno socket error] [Errno 11004] getaddrinfo failed

当我运行在我的个人笔记本电脑上时，它运行非常好。但是当我在 Office Desktop 上运行时出现错误。另外，我的 Python 版本是 2.7。希望这些信息对您有所帮助。

Answer 1

终于，伙计们……成功了！当我在其他 PC 上检查时，同样的脚本也有效。所以问题可能是因为防火墙设置或代理设置 我办公室的桌面。正在阻止此网站。

HTML Link 使用 BeautifulSoup 解析

HTML Link parsing using BeautifulSoup

python

url

beautifulsoup

filereader

filewriter