如何使用 extract_links() 从 'gb2312' 编码的网页中获取 url
how to use extract_links() to get url from a webpage encoding by 'gb2312'
环境:python 2.7 os:ubuntu
我想从网页中提取一些 link,我在 scrapy 中测试它 shell
但是我遇到了 UnicodeError:
我的代码:
le = LinkExtractor()
le.extract_links(response)
错误:
UnicodeDecodeError: 'utf8' codec can't decode byte 0xcc in position
39: invalid continuation byte
在这个网页源代码中,我发现它是编码'gb2312',所以我尝试:
print response.body.decode('gb2312') 它可以打印所有 html
但是当:
le.extract_links(response.body.decode('gb2312')),
有错误:
AttributeError: 'unicode' object has no attribute 'text'
因为 extract_links 需要 html 响应作为参数,但是 response.body response.text return 'byte' 和 'Unicode' 对象;
类型(响应)
结果:class'scrapy.http.response.html.HtmlResponse'
所以我不知道如何修复响应,并从中提取 links;
有什么方法可以指定响应 returned 是 'utf-8' 而不是 'gb2312'
Traceback (most recent call last):
File "<console>", line 1, in <module>
File "/usr/local/lib/python2.7/dist-packages/scrapy/linkextractors/lxmlhtml.py", line 128, in extract_links
links = self._extract_links(doc, response.url, response.encoding, base_url)
File "/usr/local/lib/python2.7/dist-packages/scrapy/linkextractors/__init__.py", line 109, in _extract_links
return self.link_extractor._extract_links(*args, **kwargs)
File "/usr/local/lib/python2.7/dist-packages/scrapy/linkextractors/lxmlhtml.py", line 76, in _extract_links
return self._deduplicate_if_needed(links)
File "/usr/local/lib/python2.7/dist-packages/scrapy/linkextractors/lxmlhtml.py", line 91, in _deduplicate_if_needed
return unique_list(links, key=self.link_key)
File "/usr/local/lib/python2.7/dist-packages/scrapy/utils/python.py", line 78, in unique
seenkey = key(item)
File "/usr/local/lib/python2.7/dist-packages/scrapy/linkextractors/lxmlhtml.py", line 43, in <lambda>
keep_fragments=True)
File "/usr/local/lib/python2.7/dist-packages/w3lib/url.py", line 433, in canonicalize_url
parse_url(url), encoding=encoding)
File "/usr/local/lib/python2.7/dist-packages/w3lib/url.py", line 510, in parse_url
return urlparse(to_unicode(url, encoding))
File "/usr/local/lib/python2.7/dist-packages/w3lib/util.py", line 27, in to_unicode
return text.decode(encoding, errors)
File "/usr/lib/python2.7/encodings/utf_8.py", line 16, in decode
return codecs.utf_8_decode(input, errors, True)
UnicodeDecodeError: 'utf8' codec can't decode byte 0xcc in position 39: invalid continuation byte
我认为您应该能够像这样手动指定编码:
response.replace(encoding='gb2312')
然后尝试将其传递给 link 提取器。
编辑:所以 scrapy 似乎无法在 link 处理链的某处指定 url 编码(我相信在 w3lib.url.canonicalize_url
执行重复数据删除时)。作为解决方法,您可以使用此方法:
resp = response.replace(encoding='utf8', body=response.text.encode('utf8'))
w3lib.url.canonicalize_url 在此网页中工作不正确,上述解决方法
resp = response.replace(encoding='utf8', body=response.text.encode('utf8'))
只适用于 scrapy shell
所以我们可以在spider中指定canonicalize=True
像这样:
LinkExtractor(canonicalize=True)
但是在scrapy文档中说,一般情况下,
you’re using LinkExtractor to follow links it is more robust to keep
the default canonicalize=False
环境:python 2.7 os:ubuntu
我想从网页中提取一些 link,我在 scrapy 中测试它 shell 但是我遇到了 UnicodeError:
我的代码:
le = LinkExtractor()
le.extract_links(response)
错误:
UnicodeDecodeError: 'utf8' codec can't decode byte 0xcc in position 39: invalid continuation byte
在这个网页源代码中,我发现它是编码'gb2312',所以我尝试:
print response.body.decode('gb2312') 它可以打印所有 html
但是当:
le.extract_links(response.body.decode('gb2312')),
有错误:
AttributeError: 'unicode' object has no attribute 'text'
因为 extract_links 需要 html 响应作为参数,但是 response.body response.text return 'byte' 和 'Unicode' 对象;
类型(响应)
结果:class'scrapy.http.response.html.HtmlResponse'
所以我不知道如何修复响应,并从中提取 links; 有什么方法可以指定响应 returned 是 'utf-8' 而不是 'gb2312'
Traceback (most recent call last):
File "<console>", line 1, in <module>
File "/usr/local/lib/python2.7/dist-packages/scrapy/linkextractors/lxmlhtml.py", line 128, in extract_links
links = self._extract_links(doc, response.url, response.encoding, base_url)
File "/usr/local/lib/python2.7/dist-packages/scrapy/linkextractors/__init__.py", line 109, in _extract_links
return self.link_extractor._extract_links(*args, **kwargs)
File "/usr/local/lib/python2.7/dist-packages/scrapy/linkextractors/lxmlhtml.py", line 76, in _extract_links
return self._deduplicate_if_needed(links)
File "/usr/local/lib/python2.7/dist-packages/scrapy/linkextractors/lxmlhtml.py", line 91, in _deduplicate_if_needed
return unique_list(links, key=self.link_key)
File "/usr/local/lib/python2.7/dist-packages/scrapy/utils/python.py", line 78, in unique
seenkey = key(item)
File "/usr/local/lib/python2.7/dist-packages/scrapy/linkextractors/lxmlhtml.py", line 43, in <lambda>
keep_fragments=True)
File "/usr/local/lib/python2.7/dist-packages/w3lib/url.py", line 433, in canonicalize_url
parse_url(url), encoding=encoding)
File "/usr/local/lib/python2.7/dist-packages/w3lib/url.py", line 510, in parse_url
return urlparse(to_unicode(url, encoding))
File "/usr/local/lib/python2.7/dist-packages/w3lib/util.py", line 27, in to_unicode
return text.decode(encoding, errors)
File "/usr/lib/python2.7/encodings/utf_8.py", line 16, in decode
return codecs.utf_8_decode(input, errors, True)
UnicodeDecodeError: 'utf8' codec can't decode byte 0xcc in position 39: invalid continuation byte
我认为您应该能够像这样手动指定编码:
response.replace(encoding='gb2312')
然后尝试将其传递给 link 提取器。
编辑:所以 scrapy 似乎无法在 link 处理链的某处指定 url 编码(我相信在 w3lib.url.canonicalize_url
执行重复数据删除时)。作为解决方法,您可以使用此方法:
resp = response.replace(encoding='utf8', body=response.text.encode('utf8'))
w3lib.url.canonicalize_url 在此网页中工作不正确,上述解决方法
resp = response.replace(encoding='utf8', body=response.text.encode('utf8'))
只适用于 scrapy shell
所以我们可以在spider中指定canonicalize=True
像这样:
LinkExtractor(canonicalize=True)
但是在scrapy文档中说,一般情况下,
you’re using LinkExtractor to follow links it is more robust to keep the default canonicalize=False