如何通过代理使用 Python 访问网页
How to access webpages using Python via a proxy
我正在编写一个小程序,通过提供 URL 从网页中获取所有超链接,但我所在的网络似乎正在使用代理并且无法获取 ..
我的代码:
import sys
import urllib
import urlparse
from bs4 import BeautifulSoup
def process(url):
page = urllib.urlopen(url)
text = page.read()
page.close()
soup = BeautifulSoup(text)
with open('s.txt','w') as file:
for tag in soup.findAll('a', href=True):
tag['href'] = urlparse.urljoin(url, tag['href'])
print tag['href']
file.write('\n')
file.write(tag['href'])
def main():
if len(sys.argv) == 1:
print 'No url !!'
sys.exit(1)
for url in sys.argv[1:]:
process(url)
您用于 HTTP 访问的 urllib
库不支持代理身份验证(它支持未经身份验证的代理)。来自 the docs:
Proxies which require authentication for use are not currently
supported; this is considered an implementation limitation.
我建议您切换到 urllib2
并按照 the answer to this post 中的说明使用它。
您可以改用请求模块。
import requests
proxies = { 'http': 'http://host/' }
# or if it requires authentication 'http://user:pass@host/' instead
r = requests.get(url, proxies=proxies)
text = r.text
我正在编写一个小程序,通过提供 URL 从网页中获取所有超链接,但我所在的网络似乎正在使用代理并且无法获取 .. 我的代码:
import sys
import urllib
import urlparse
from bs4 import BeautifulSoup
def process(url):
page = urllib.urlopen(url)
text = page.read()
page.close()
soup = BeautifulSoup(text)
with open('s.txt','w') as file:
for tag in soup.findAll('a', href=True):
tag['href'] = urlparse.urljoin(url, tag['href'])
print tag['href']
file.write('\n')
file.write(tag['href'])
def main():
if len(sys.argv) == 1:
print 'No url !!'
sys.exit(1)
for url in sys.argv[1:]:
process(url)
您用于 HTTP 访问的 urllib
库不支持代理身份验证(它支持未经身份验证的代理)。来自 the docs:
Proxies which require authentication for use are not currently supported; this is considered an implementation limitation.
我建议您切换到 urllib2
并按照 the answer to this post 中的说明使用它。
您可以改用请求模块。
import requests
proxies = { 'http': 'http://host/' }
# or if it requires authentication 'http://user:pass@host/' instead
r = requests.get(url, proxies=proxies)
text = r.text