使用带有 headers 和代理的 urllib 抓取 web-page 数据

Scraping web-page data with urllib with headers and proxy

我有 web-page 数据,但现在我想通过代理获取它。我该怎么做?

import urllib

def get_main_html():
   request = urllib.request.Request(URL, headers=headers)
   doc = lh.parse(urllib.request.urlopen(request))
   return doc

来自文档

urllib will auto-detect your proxy settings and use those. This is through the ProxyHandler, which is part of the normal handler chain when a proxy setting is detected. Normally that’s a good thing, but there are occasions when it may not be helpful. One way to do this is to setup our own ProxyHandler, with no proxies defined. This is done using similar steps to setting up a Basic Authentication handle.

检查这个,https://docs.python.org/3/howto/urllib2.html#proxies

使用:

proxies = {'http': 'http://myproxy.example.com:1234'}
print "Using HTTP proxy %s" % proxies['http']
urllib.urlopen("http://yoursite", proxies=proxies)

您可以使用socksipy

import ftplib 
import telnetlib 
import urllib2
import socks
#Set the proxy information
socks.setdefaultproxy(socks.PROXY_TYPE_SOCKS5, 'localhost', 9050)
#Route an FTP session through the SOCKS proxy
socks.wrapmodule(ftplib)
ftp = ftplib.FTP('cdimage.ubuntu.com') 
ftp.login('anonymous', 'support@aol.com') 
print ftp.dir('cdimage') ftp.close()
#Route a telnet connection through the SOCKS proxy
socks.wrapmodule(telnetlib) 
tn = telnetlib.Telnet('achaea.com') 
print tn.read_very_eager() tn.close()
#Route an HTTP request through the SOCKS proxy
socks.wrapmodule(urllib2) 
print urllib2.urlopen('http://www.whatismyip.com/automation/n09230945.asp').read()

你的情况:

import urllib
import socks
#Set the proxy information
socks.setdefaultproxy(socks.PROXY_TYPE_SOCKS5, 'localhost', 9050)
socks.wrapmodule(urllib)

def get_main_html():
   request = urllib.request.Request(URL, headers=headers)
   doc = lh.parse(urllib.request.urlopen(request))
   return doc