在 python3.5 中使用 proxybroker 会引发编码错误
Using proxybroker in python3.5 throws an encoding error
我正在尝试使用 proxybroker 为某些国家/地区生成包含活动代理的文件。我总是在尝试获取代理时遇到同样的错误。该错误似乎是 proxbroker 使用的数据包中的 encoding/decoding 错误。但我怀疑可能有更好的方法来使用 proxybroker。
这是导致问题的代码:
def gather_proxies(countries):
"""
This method uses the proxybroker package to asynchronously get two new proxies per specified country
and returns the proxies as a list of country and proxy.
:param countries: The ISO style country codes to fetch proxies for. Countries is a list of two letter strings.
:return: A list of proxies that are themself a list with two paramters[Location, proxy address].
"""
proxy_list = []
types = ['HTTP']
for country in countries:
loop = asyncio.get_event_loop()
proxies = asyncio.Queue(loop=loop)
broker = Broker(proxies, loop=loop,)
loop.run_until_complete(broker.find(limit=2, countries=country, types=types))
while True:
proxy = proxies.get_nowait()
if proxy is None:
break
print(str(proxy))
proxy_list.append([country, proxy.host + ":" + str(proxy.port)])
return proxy_list
和错误信息:
../app/main/download_thread.py:344: in update_proxies
proxy_list = gather_proxies(country_list)
../app/main/download_thread.py:368: in gather_proxies
loop.run_until_complete(broker.find(limit=2, countries=country, types=types))
/usr/lib/python3.5/asyncio/base_events.py:387: in run_until_complete
return future.result()
/usr/lib/python3.5/asyncio/futures.py:274: in result
raise self._exception
/usr/lib/python3.5/asyncio/tasks.py:241: in _step
result = coro.throw(exc)
../venv/lib/python3.5/site-packages/proxybroker/api.py:108: in find
await self._run(self._checker.check_judges(), action)
../venv/lib/python3.5/site-packages/proxybroker/api.py:114: in _run
await tasks
/usr/lib/python3.5/asyncio/futures.py:361: in __iter__
yield self # This tells Task to wait for completion.
/usr/lib/python3.5/asyncio/tasks.py:296: in _wakeup
future.result()
/usr/lib/python3.5/asyncio/futures.py:274: in result
raise self._exception
/usr/lib/python3.5/asyncio/tasks.py:241: in _step
result = coro.throw(exc)
../venv/lib/python3.5/site-packages/proxybroker/checker.py:26: in check_judges
await asyncio.gather(*[j.check() for j in self._judges])
/usr/lib/python3.5/asyncio/futures.py:361: in __iter__
yield self # This tells Task to wait for completion.
/usr/lib/python3.5/asyncio/tasks.py:296: in _wakeup
future.result()
/usr/lib/python3.5/asyncio/futures.py:274: in result
raise self._exception
/usr/lib/python3.5/asyncio/tasks.py:239: in _step
result = coro.send(None)
../venv/lib/python3.5/site-packages/proxybroker/judge.py:62: in check
page = await resp.text()
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
self = <ClientResponse(http://ip.spys.ru/) [200 OK]>
<CIMultiDictProxy('Date': 'Thu, 18 Aug 2016 11:02:53 GMT', 'Server': 'Ap...': 'no-cache', 'Vary': 'Accept-Encoding', 'Transfer-Encoding': 'chunked', 'Content-Type': 'text/html; charset=UTF-8')>
encoding = 'utf-8'
@asyncio.coroutine
def text(self, encoding=None):
"""Read response payload and decode."""
if self._content is None:
yield from self.read()
if encoding is None:
encoding = self._get_encoding()
> return self._content.decode(encoding)
E UnicodeDecodeError: 'utf-8' codec can't decode byte 0xd6 in position 5568: invalid continuation byte
../venv/lib/python3.5/site-packages/aiohttp/client_reqrep.py:758: UnicodeDecodeError
问题似乎出在 proxybroker 或 aiohttp 包中。但由于它应该是经过测试的包,问题可能出在我的代码上。
谁能看出我做错了什么,或者谁能对 proxybroker 的使用有什么建议?
问题出在 resp.text()
调用中。
它以文本形式检索 html 页。
aiohttp 尝试使用 chardet
库确定正确的编码,但对于格式错误的页面,这是不可能的。
我认为 resp.text()
应该替换为 resp.read()
,以便将页面提取为 bytes
而无需解码为 str
。
我正在尝试使用 proxybroker 为某些国家/地区生成包含活动代理的文件。我总是在尝试获取代理时遇到同样的错误。该错误似乎是 proxbroker 使用的数据包中的 encoding/decoding 错误。但我怀疑可能有更好的方法来使用 proxybroker。
这是导致问题的代码:
def gather_proxies(countries):
"""
This method uses the proxybroker package to asynchronously get two new proxies per specified country
and returns the proxies as a list of country and proxy.
:param countries: The ISO style country codes to fetch proxies for. Countries is a list of two letter strings.
:return: A list of proxies that are themself a list with two paramters[Location, proxy address].
"""
proxy_list = []
types = ['HTTP']
for country in countries:
loop = asyncio.get_event_loop()
proxies = asyncio.Queue(loop=loop)
broker = Broker(proxies, loop=loop,)
loop.run_until_complete(broker.find(limit=2, countries=country, types=types))
while True:
proxy = proxies.get_nowait()
if proxy is None:
break
print(str(proxy))
proxy_list.append([country, proxy.host + ":" + str(proxy.port)])
return proxy_list
和错误信息:
../app/main/download_thread.py:344: in update_proxies
proxy_list = gather_proxies(country_list)
../app/main/download_thread.py:368: in gather_proxies
loop.run_until_complete(broker.find(limit=2, countries=country, types=types))
/usr/lib/python3.5/asyncio/base_events.py:387: in run_until_complete
return future.result()
/usr/lib/python3.5/asyncio/futures.py:274: in result
raise self._exception
/usr/lib/python3.5/asyncio/tasks.py:241: in _step
result = coro.throw(exc)
../venv/lib/python3.5/site-packages/proxybroker/api.py:108: in find
await self._run(self._checker.check_judges(), action)
../venv/lib/python3.5/site-packages/proxybroker/api.py:114: in _run
await tasks
/usr/lib/python3.5/asyncio/futures.py:361: in __iter__
yield self # This tells Task to wait for completion.
/usr/lib/python3.5/asyncio/tasks.py:296: in _wakeup
future.result()
/usr/lib/python3.5/asyncio/futures.py:274: in result
raise self._exception
/usr/lib/python3.5/asyncio/tasks.py:241: in _step
result = coro.throw(exc)
../venv/lib/python3.5/site-packages/proxybroker/checker.py:26: in check_judges
await asyncio.gather(*[j.check() for j in self._judges])
/usr/lib/python3.5/asyncio/futures.py:361: in __iter__
yield self # This tells Task to wait for completion.
/usr/lib/python3.5/asyncio/tasks.py:296: in _wakeup
future.result()
/usr/lib/python3.5/asyncio/futures.py:274: in result
raise self._exception
/usr/lib/python3.5/asyncio/tasks.py:239: in _step
result = coro.send(None)
../venv/lib/python3.5/site-packages/proxybroker/judge.py:62: in check
page = await resp.text()
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
self = <ClientResponse(http://ip.spys.ru/) [200 OK]>
<CIMultiDictProxy('Date': 'Thu, 18 Aug 2016 11:02:53 GMT', 'Server': 'Ap...': 'no-cache', 'Vary': 'Accept-Encoding', 'Transfer-Encoding': 'chunked', 'Content-Type': 'text/html; charset=UTF-8')>
encoding = 'utf-8'
@asyncio.coroutine
def text(self, encoding=None):
"""Read response payload and decode."""
if self._content is None:
yield from self.read()
if encoding is None:
encoding = self._get_encoding()
> return self._content.decode(encoding)
E UnicodeDecodeError: 'utf-8' codec can't decode byte 0xd6 in position 5568: invalid continuation byte
../venv/lib/python3.5/site-packages/aiohttp/client_reqrep.py:758: UnicodeDecodeError
问题似乎出在 proxybroker 或 aiohttp 包中。但由于它应该是经过测试的包,问题可能出在我的代码上。
谁能看出我做错了什么,或者谁能对 proxybroker 的使用有什么建议?
问题出在 resp.text()
调用中。
它以文本形式检索 html 页。
aiohttp 尝试使用 chardet
库确定正确的编码,但对于格式错误的页面,这是不可能的。
我认为 resp.text()
应该替换为 resp.read()
,以便将页面提取为 bytes
而无需解码为 str
。