Python 302 重定向中的 urllib unicode 异常
Python urllib unicode exception in 302 redirection
情况是:
我正在抓取一个网站,页面的 urls 遵循以下模式:
http://www.pageadress/somestuff/ID-HERE/
没有异常。
我有很多 id 需要抓取,其中大部分都可以正常工作。
但是,页面以 portal-like 方式运行。在浏览器中,当您输入这样的地址时,您将被重定向到:
http://www.pageadress/somestuff/ID-HERE-title_of_subpage
可能有问题的是有时该标题可能包含 non-ascii 个字符(大约 0.01% 的情况),因此(我认为这是问题)我得到了例外:
File "/usr/lib/python3.4/urllib/request.py", line 161, in urlopen
return opener.open(url, data, timeout)
File "/usr/lib/python3.4/urllib/request.py", line 469, in open
response = meth(req, response)
File "/usr/lib/python3.4/urllib/request.py", line 579, in http_response
'http', request, response, code, msg, hdrs)
File "/usr/lib/python3.4/urllib/request.py", line 501, in error
result = self._call_chain(*args)
File "/usr/lib/python3.4/urllib/request.py", line 441, in _call_chain
result = func(*args)
File "/usr/lib/python3.4/urllib/request.py", line 684, in http_error_302
return self.parent.open(new, timeout=req.timeout)
File "/usr/lib/python3.4/urllib/request.py", line 463, in open
response = self._open(req, data)
File "/usr/lib/python3.4/urllib/request.py", line 481, in _open
'_open', req)
File "/usr/lib/python3.4/urllib/request.py", line 441, in _call_chain
result = func(*args)
File "/usr/lib/python3.4/urllib/request.py", line 1210, in http_open
return self.do_open(http.client.HTTPConnection, req)
File "/usr/lib/python3.4/urllib/request.py", line 1182, in do_open
h.request(req.get_method(), req.selector, req.data, headers)
File "/usr/lib/python3.4/http/client.py", line 1088, in request
self._send_request(method, url, body, headers)
File "/usr/lib/python3.4/http/client.py", line 1116, in _send_request
self.putrequest(method, url, **skips)
File "/usr/lib/python3.4/http/client.py", line 973, in putrequest
self._output(request.encode('ascii'))
UnicodeEncodeError: 'ascii' codec can't encode characters in position 38-39: ordinal not in range(128).
奇怪的是,我重定向到的 url 中没有任何 Unicode 字符实际上位于位置 38-39,但其他位置有。
正在使用的代码:
import socket
import urllib.parse
import urllib.request
socket.setdefaulttimeout(30)
url = "https://www.bettingexpert.com/archive/tip/3207221"
headers = {'User-Agent': 'Mozilla/5.0'}
content = urllib.request.urlopen(urllib.request.Request(url, None, headers)).read().decode('utf-8')
有什么方法可以绕过它,最好不使用其他库吗?
//哦 python 的光荣世界,创造了 1000 个问题,如果我在 ruby.
中写作,我什至认为这是不可能的
所以,我找到了针对我的特定问题的解决方案。
我刚刚从他们的 api 中收集了 'url' 的剩余部分,经过一些小的转换后,我可以访问该页面而无需任何重定向。
当然,这并不意味着我解决了一般问题 - 它可能会在未来晚些时候回来,所以我开发了一个 'solution'.
把这段代码发在这里,我基本保证自己永远不会被录用为程序员,所以吃饭的时候别看
"Capybara" gem 需要闹鬼,为什么不呢?
#test.py
import socket
import urllib.parse
import urllib.request
import os
tip_id = 3207221
socket.setdefaulttimeout(30)
url = "http://www.bettingexpert.com/archive/tip/" + tip_id.__str__()
headers = {'User-Agent': 'Mozilla/5.0'}
try:
content = urllib.request.urlopen(urllib.request.Request(url, None, headers)).read().decode('utf-8')
except UnicodeEncodeError:
print("Overkill activated")
os.system('ruby test.rb ' + tip_id.__str__())
with open(tip_id.__str__(), 'r') as file:
content = file.read()
os.remove(tip_id.__str__())
print(content)
.
#test.rb
require 'capybara'
require 'capybara/dsl'
require 'capybara/poltergeist'
Capybara.register_driver :poltergeist_no_timeout do |app|
driver = Capybara::Poltergeist::Driver.new(app, timeout: 30)
driver.browser.url_blacklist = %w(
http://fonts.googleapis.com
http://html5shiv.googlecode.com
)
driver
end
Capybara.default_driver = :poltergeist_no_timeout
Capybara.run_server = false
include Capybara::DSL
begin
page.reset_session!
page.visit("http://www.bettingexpert.com/archive/tip/#{ARGV[0]}")
rescue
retry
end
File.open(ARGV[0], 'w') do |file|
file.print(page.html)
end
情况是: 我正在抓取一个网站,页面的 urls 遵循以下模式:
http://www.pageadress/somestuff/ID-HERE/
没有异常。 我有很多 id 需要抓取,其中大部分都可以正常工作。 但是,页面以 portal-like 方式运行。在浏览器中,当您输入这样的地址时,您将被重定向到:
http://www.pageadress/somestuff/ID-HERE-title_of_subpage
可能有问题的是有时该标题可能包含 non-ascii 个字符(大约 0.01% 的情况),因此(我认为这是问题)我得到了例外:
File "/usr/lib/python3.4/urllib/request.py", line 161, in urlopen
return opener.open(url, data, timeout)
File "/usr/lib/python3.4/urllib/request.py", line 469, in open
response = meth(req, response)
File "/usr/lib/python3.4/urllib/request.py", line 579, in http_response
'http', request, response, code, msg, hdrs)
File "/usr/lib/python3.4/urllib/request.py", line 501, in error
result = self._call_chain(*args)
File "/usr/lib/python3.4/urllib/request.py", line 441, in _call_chain
result = func(*args)
File "/usr/lib/python3.4/urllib/request.py", line 684, in http_error_302
return self.parent.open(new, timeout=req.timeout)
File "/usr/lib/python3.4/urllib/request.py", line 463, in open
response = self._open(req, data)
File "/usr/lib/python3.4/urllib/request.py", line 481, in _open
'_open', req)
File "/usr/lib/python3.4/urllib/request.py", line 441, in _call_chain
result = func(*args)
File "/usr/lib/python3.4/urllib/request.py", line 1210, in http_open
return self.do_open(http.client.HTTPConnection, req)
File "/usr/lib/python3.4/urllib/request.py", line 1182, in do_open
h.request(req.get_method(), req.selector, req.data, headers)
File "/usr/lib/python3.4/http/client.py", line 1088, in request
self._send_request(method, url, body, headers)
File "/usr/lib/python3.4/http/client.py", line 1116, in _send_request
self.putrequest(method, url, **skips)
File "/usr/lib/python3.4/http/client.py", line 973, in putrequest
self._output(request.encode('ascii'))
UnicodeEncodeError: 'ascii' codec can't encode characters in position 38-39: ordinal not in range(128).
奇怪的是,我重定向到的 url 中没有任何 Unicode 字符实际上位于位置 38-39,但其他位置有。
正在使用的代码:
import socket
import urllib.parse
import urllib.request
socket.setdefaulttimeout(30)
url = "https://www.bettingexpert.com/archive/tip/3207221"
headers = {'User-Agent': 'Mozilla/5.0'}
content = urllib.request.urlopen(urllib.request.Request(url, None, headers)).read().decode('utf-8')
有什么方法可以绕过它,最好不使用其他库吗?
//哦 python 的光荣世界,创造了 1000 个问题,如果我在 ruby.
中写作,我什至认为这是不可能的所以,我找到了针对我的特定问题的解决方案。 我刚刚从他们的 api 中收集了 'url' 的剩余部分,经过一些小的转换后,我可以访问该页面而无需任何重定向。 当然,这并不意味着我解决了一般问题 - 它可能会在未来晚些时候回来,所以我开发了一个 'solution'.
把这段代码发在这里,我基本保证自己永远不会被录用为程序员,所以吃饭的时候别看
"Capybara" gem 需要闹鬼,为什么不呢?
#test.py
import socket
import urllib.parse
import urllib.request
import os
tip_id = 3207221
socket.setdefaulttimeout(30)
url = "http://www.bettingexpert.com/archive/tip/" + tip_id.__str__()
headers = {'User-Agent': 'Mozilla/5.0'}
try:
content = urllib.request.urlopen(urllib.request.Request(url, None, headers)).read().decode('utf-8')
except UnicodeEncodeError:
print("Overkill activated")
os.system('ruby test.rb ' + tip_id.__str__())
with open(tip_id.__str__(), 'r') as file:
content = file.read()
os.remove(tip_id.__str__())
print(content)
.
#test.rb
require 'capybara'
require 'capybara/dsl'
require 'capybara/poltergeist'
Capybara.register_driver :poltergeist_no_timeout do |app|
driver = Capybara::Poltergeist::Driver.new(app, timeout: 30)
driver.browser.url_blacklist = %w(
http://fonts.googleapis.com
http://html5shiv.googlecode.com
)
driver
end
Capybara.default_driver = :poltergeist_no_timeout
Capybara.run_server = false
include Capybara::DSL
begin
page.reset_session!
page.visit("http://www.bettingexpert.com/archive/tip/#{ARGV[0]}")
rescue
retry
end
File.open(ARGV[0], 'w') do |file|
file.print(page.html)
end