在没有 http:// 的情况下修复 url 的正确方法

Question

我正在尝试 open 这种格式的 url 列表，在 Urllib2:

中使用

google.com
facebook.com
youtube.com
yahoo.com
baidu.com

使用此方法：

urllib2.urlopen(url):

并出现此错误：

File "fetcher.py", line 98, in fetch_urls_and_save
  response = urllib2.urlopen(url)
File "urllib2.py", line 154, in urlopen
  return opener.open(url, data, timeout)
File "urllib2.py", line 423, in open
  protocol = req.get_type()
File "urllib2.py", line 285, in get_type
  raise ValueError, "unknown url type: %s" % self.__original

那么，我的问题是：

是否有正确的方法 'fix' 这些网址，或者我应该简单地在每个字符串前附加 http:// ？我认为这不是最好的解决方案，因为以 https://?

开头的 url 怎么办？

Answer 1

我建议只将 http:// 附加到字符串中，因为许多使用 https:// 方案的网站会通过重定向请求自动切换到它。

您可以使用 getcode() 函数检查 urlopen 返回的状态。

a=urllib2.urlopen("http://google.com")
print a.getcode() # prints 200

在没有 http:// 的情况下修复 url 的正确方法

Proper way to fix a url without http://

python

url

urllib

urllib2