Python - ValueError: unknown url type
Python - ValueError: unknown url type
我正在尝试从 <iframes>
属性中提取来源,如下所示:
iframes = [<iframe frameborder="no" height="160px" scrolling="no" src="https://w.soundcloud.com/player/?url=https%3A//api.soundcloud.com/tracks/308197184%3Fsecret_token%3Ds-VtArH&auto_play=false&hide_related=false&show_comments=true&show_user=true&show_reposts=false&visual=true" width="100%"></iframe>, <iframe allowtransparency="true" frameborder="0" scrolling="no" src="//www.facebook.com/plugins/likebox.php?href=https%3A%2F%2Fwww.facebook.com%2FPauseMusicale&width=300&height=62&show_faces=false&colorscheme=light&stream=false&show_border=false&header=false" style="border:none; overflow:hidden; width:300px; height:62px;"></iframe>, <iframe allowfullscreen="" frameborder="0" height="169" src="//www.youtube.com/embed/videoseries?list=PLNKCTdT9YSESoQnj5tPP4P9kaIwBCx7F1" width="100%"></iframe>]
但是当我尝试提取它时:
for iframe in iframes:
url = urllib2.urlopen(iframe.attrs['src'])
print (url)
我收到以下错误:
url = urllib2.urlopen(iframe.attrs['src'])
File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/urllib2.py", line 154, in urlopen
return opener.open(url, data, timeout)
File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/urllib2.py", line 423, in open
protocol = req.get_type()
File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/urllib2.py", line 285, in get_type
raise ValueError, "unknown url type: %s" % self.__original
ValueError: unknown url type: //www.facebook.com/plugins/likebox.php?href=https%3A%2F%2Fwww.facebook.com%2FPauseMusicale&width=300&height=62&show_faces=false&colorscheme=light&stream=false&show_border=false&header=false
为什么我得到 url 而 //www
之前没有 http
?
有什么解决方法吗?
why am I getting url with no http before the //www
这是向用户代理指示在发出后续请求时应使用与当前页面相同的方案(http、https、ftp、文件等)的常用方法。因此,例如,如果您通过 https 加载当前页面,那么那些省略该方案的 URL 将通过 https 访问。
Is there some workaround this?
您可以使用 urlparse
模块在 Python 2 中处理此问题(因为那是您的 Python 版本):
# from urllib.parse import urlparse, urlunparse # Python 3
from urlparse import urlparse, urlunparse
for iframe in iframes:
scheme, netloc, path, params, query, fragment = urlparse(iframe.attrs['src'])
if not scheme:
scheme = 'http' # default scheme you used when getting the current page
url = urlunparse((scheme, netloc, path, params, query, fragment))
print('Fetching {}'.format(url))
f = urllib2.urlopen(url)
# print(f.read()) # dumps the response content
如果你运行上面的代码你应该看到这个输出:
Fetching https://w.soundcloud.com/player/?url=https%3A//api.soundcloud.com/tracks/308197184%3Fsecret_token%3Ds-VtArH&auto_play=false&hide_related=false&show_comments=true&show_user=true&show_reposts=false&visual=true
Fetching http://www.facebook.com/plugins/likebox.php?href=https%3A%2F%2Fwww.facebook.com%2FPauseMusicale&width=300&height=62&show_faces=false&colorscheme=light&stream=false&show_border=false&header=false
Fetching http://www.youtube.com/embed/videoseries?list=PLNKCTdT9YSESoQnj5tPP4P9kaIwBCx7F1
这表明默认方案已应用于 URL。
我正在尝试从 <iframes>
属性中提取来源,如下所示:
iframes = [<iframe frameborder="no" height="160px" scrolling="no" src="https://w.soundcloud.com/player/?url=https%3A//api.soundcloud.com/tracks/308197184%3Fsecret_token%3Ds-VtArH&auto_play=false&hide_related=false&show_comments=true&show_user=true&show_reposts=false&visual=true" width="100%"></iframe>, <iframe allowtransparency="true" frameborder="0" scrolling="no" src="//www.facebook.com/plugins/likebox.php?href=https%3A%2F%2Fwww.facebook.com%2FPauseMusicale&width=300&height=62&show_faces=false&colorscheme=light&stream=false&show_border=false&header=false" style="border:none; overflow:hidden; width:300px; height:62px;"></iframe>, <iframe allowfullscreen="" frameborder="0" height="169" src="//www.youtube.com/embed/videoseries?list=PLNKCTdT9YSESoQnj5tPP4P9kaIwBCx7F1" width="100%"></iframe>]
但是当我尝试提取它时:
for iframe in iframes:
url = urllib2.urlopen(iframe.attrs['src'])
print (url)
我收到以下错误:
url = urllib2.urlopen(iframe.attrs['src'])
File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/urllib2.py", line 154, in urlopen
return opener.open(url, data, timeout)
File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/urllib2.py", line 423, in open
protocol = req.get_type()
File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/urllib2.py", line 285, in get_type
raise ValueError, "unknown url type: %s" % self.__original
ValueError: unknown url type: //www.facebook.com/plugins/likebox.php?href=https%3A%2F%2Fwww.facebook.com%2FPauseMusicale&width=300&height=62&show_faces=false&colorscheme=light&stream=false&show_border=false&header=false
为什么我得到 url 而 //www
之前没有 http
?
有什么解决方法吗?
why am I getting url with no http before the //www
这是向用户代理指示在发出后续请求时应使用与当前页面相同的方案(http、https、ftp、文件等)的常用方法。因此,例如,如果您通过 https 加载当前页面,那么那些省略该方案的 URL 将通过 https 访问。
Is there some workaround this?
您可以使用 urlparse
模块在 Python 2 中处理此问题(因为那是您的 Python 版本):
# from urllib.parse import urlparse, urlunparse # Python 3
from urlparse import urlparse, urlunparse
for iframe in iframes:
scheme, netloc, path, params, query, fragment = urlparse(iframe.attrs['src'])
if not scheme:
scheme = 'http' # default scheme you used when getting the current page
url = urlunparse((scheme, netloc, path, params, query, fragment))
print('Fetching {}'.format(url))
f = urllib2.urlopen(url)
# print(f.read()) # dumps the response content
如果你运行上面的代码你应该看到这个输出:
Fetching https://w.soundcloud.com/player/?url=https%3A//api.soundcloud.com/tracks/308197184%3Fsecret_token%3Ds-VtArH&auto_play=false&hide_related=false&show_comments=true&show_user=true&show_reposts=false&visual=true Fetching http://www.facebook.com/plugins/likebox.php?href=https%3A%2F%2Fwww.facebook.com%2FPauseMusicale&width=300&height=62&show_faces=false&colorscheme=light&stream=false&show_border=false&header=false Fetching http://www.youtube.com/embed/videoseries?list=PLNKCTdT9YSESoQnj5tPP4P9kaIwBCx7F1
这表明默认方案已应用于 URL。