Scrapy returns unicode - 如何转换为字符串?

Scrapy returns unicode - how to convert to a string?

当我使用 scrapy shell 向 url 发出请求时,我得到这样的结果:

In [6]: sel.xpath("//div[@class='my_class']").extract()
 [u'<div class="my_class"><ul><li class="parent">\n<a href="/category/tractors-ride-on-mowers/">\n\u0422\u0420\u0410\u041a\u0422\u041e\u0420\u042b \u0438 \u0420\u0410\u0419\u0414\u0415\u0420\u042b</a>\n<div class="sub1"><div class="str"></div><ul><li><a href="/category/lawn-tractors/" class="">\u0421\u0430\u0434\u043e\u0432\u044b\u0435 \u0442\u0440\u0430\u043a\u0442\u043e\u0440\u04....

如何将其转换为可读字符串?

打印(或写入文件)后即可读取

>>> u = u'<div class="my_class"><ul><li class="parent">\n<a href="/category/tractors-ride-on-mowers/">\n\u0422\u0420\u0410\u041a\u0422\u041e\u0420\u042b \u0438 \u0420\u0410\u0419\u0414\u0415\u0420\u042b</a>\n<div class="sub1"><div class="str"></div><ul><li><a href="/category/lawn-tractors/" class="">\u0421\u0430\u0434\u043e\u0432\u044b\u0435 \u0442\u0440\u0430\u043a\u0442\u043e\u0440'
>>> print (u)
<div class="my_class"><ul><li class="parent">
<a href="/category/tractors-ride-on-mowers/">
ТРАКТОРЫ и РАЙДЕРЫ</a>
<div class="sub1"><div class="str"></div><ul><li><a href="/category/lawn-tractors/" class="">Садовые трактор
>>> 

几点评论:

  • sel.xpath("//div[@class='my_class']") 选择 div 个元素。

  • sel.xpath("//div[@class='my_class']").extract() 为您提供所选元素的字符串表示形式 HTML,作为列表,如果文本节点在选择范围内,则 unicode 内容为 \u escape sequences包含 Unicode 代码点。

您也可以直接使用 XPath's string() function 请求所选节点的字符串表示形式:

  • sel.xpath("string(//div[@class='my_class'])").extract()

  • 或使用 text() 节点的常见字符串连接模式:"".join(sel.xpath("//div[@class='my_class']//text()").extract())

请注意,string() 将仅考虑与表达式匹配的第一个元素作为参数。来自 XPath 1.0 规范:

A node-set is converted to a string by returning the string-value of the node in the node-set that is first in document order.


示例 scrapy shell 会话:

$ scrapy shell
[s] Available Scrapy objects:
[s]   crawler    <scrapy.crawler.Crawler object at 0x7f06700bc2d0>
[s]   item       {}
[s]   settings   <scrapy.settings.Settings object at 0x7f06700b6f10>
[s] Useful shortcuts:
[s]   shelp()           Shell help (print this help)
[s]   fetch(req_or_url) Fetch request (or URL) and update local objects
[s]   view(response)    View response in a browser

In [1]: import scrapy

In [2]: sel = scrapy.Selector(text=u'''<div class="my_class"><ul><li class="parent">\n<a href="/category/tractors-ride-on-mowers/">\n\u0422\u0420\u0410\u041a\u0422\u041e\u0420\u042b \u0438 \u0420\u0410\u0419\u0414\u0415\u0420\u042b</a>\n<div class="sub1"><div class="str"></div><ul><li><a href="/category/lawn-tractors/" class="">\u0421\u0430\u0434\u043e\u0432\u044b\u0435 \u0442\u0440\u0430\u043a\u0442\u043e\u0440''')

In [3]: print "".join(sel.xpath('//div[@class="my_class"]//text()').extract())


ТРАКТОРЫ и РАЙДЕРЫ
Садовые трактор

In [4]: for r in sel.xpath('string(//div[@class="my_class"])').extract():
    print r
   ...:     


ТРАКТОРЫ и РАЙДЕРЫ
Садовые трактор

In [5]: