Scrapy returns unicode - 如何转换为字符串?
Scrapy returns unicode - how to convert to a string?
当我使用 scrapy shell 向 url 发出请求时,我得到这样的结果:
In [6]: sel.xpath("//div[@class='my_class']").extract()
[u'<div class="my_class"><ul><li class="parent">\n<a href="/category/tractors-ride-on-mowers/">\n\u0422\u0420\u0410\u041a\u0422\u041e\u0420\u042b \u0438 \u0420\u0410\u0419\u0414\u0415\u0420\u042b</a>\n<div class="sub1"><div class="str"></div><ul><li><a href="/category/lawn-tractors/" class="">\u0421\u0430\u0434\u043e\u0432\u044b\u0435 \u0442\u0440\u0430\u043a\u0442\u043e\u0440\u04....
如何将其转换为可读字符串?
打印(或写入文件)后即可读取
>>> u = u'<div class="my_class"><ul><li class="parent">\n<a href="/category/tractors-ride-on-mowers/">\n\u0422\u0420\u0410\u041a\u0422\u041e\u0420\u042b \u0438 \u0420\u0410\u0419\u0414\u0415\u0420\u042b</a>\n<div class="sub1"><div class="str"></div><ul><li><a href="/category/lawn-tractors/" class="">\u0421\u0430\u0434\u043e\u0432\u044b\u0435 \u0442\u0440\u0430\u043a\u0442\u043e\u0440'
>>> print (u)
<div class="my_class"><ul><li class="parent">
<a href="/category/tractors-ride-on-mowers/">
ТРАКТОРЫ и РАЙДЕРЫ</a>
<div class="sub1"><div class="str"></div><ul><li><a href="/category/lawn-tractors/" class="">Садовые трактор
>>>
几点评论:
sel.xpath("//div[@class='my_class']")
选择 div
个元素。
sel.xpath("//div[@class='my_class']").extract()
为您提供所选元素的字符串表示形式 HTML,作为列表,如果文本节点在选择范围内,则 unicode 内容为 \u
escape sequences包含 Unicode 代码点。
您也可以直接使用 XPath's string()
function 请求所选节点的字符串表示形式:
sel.xpath("string(//div[@class='my_class'])").extract()
或使用 text()
节点的常见字符串连接模式:"".join(sel.xpath("//div[@class='my_class']//text()").extract())
请注意,string()
将仅考虑与表达式匹配的第一个元素作为参数。来自 XPath 1.0 规范:
A node-set is converted to a string by returning the string-value of the node in the node-set that is first in document order.
示例 scrapy shell 会话:
$ scrapy shell
[s] Available Scrapy objects:
[s] crawler <scrapy.crawler.Crawler object at 0x7f06700bc2d0>
[s] item {}
[s] settings <scrapy.settings.Settings object at 0x7f06700b6f10>
[s] Useful shortcuts:
[s] shelp() Shell help (print this help)
[s] fetch(req_or_url) Fetch request (or URL) and update local objects
[s] view(response) View response in a browser
In [1]: import scrapy
In [2]: sel = scrapy.Selector(text=u'''<div class="my_class"><ul><li class="parent">\n<a href="/category/tractors-ride-on-mowers/">\n\u0422\u0420\u0410\u041a\u0422\u041e\u0420\u042b \u0438 \u0420\u0410\u0419\u0414\u0415\u0420\u042b</a>\n<div class="sub1"><div class="str"></div><ul><li><a href="/category/lawn-tractors/" class="">\u0421\u0430\u0434\u043e\u0432\u044b\u0435 \u0442\u0440\u0430\u043a\u0442\u043e\u0440''')
In [3]: print "".join(sel.xpath('//div[@class="my_class"]//text()').extract())
ТРАКТОРЫ и РАЙДЕРЫ
Садовые трактор
In [4]: for r in sel.xpath('string(//div[@class="my_class"])').extract():
print r
...:
ТРАКТОРЫ и РАЙДЕРЫ
Садовые трактор
In [5]:
当我使用 scrapy shell 向 url 发出请求时,我得到这样的结果:
In [6]: sel.xpath("//div[@class='my_class']").extract()
[u'<div class="my_class"><ul><li class="parent">\n<a href="/category/tractors-ride-on-mowers/">\n\u0422\u0420\u0410\u041a\u0422\u041e\u0420\u042b \u0438 \u0420\u0410\u0419\u0414\u0415\u0420\u042b</a>\n<div class="sub1"><div class="str"></div><ul><li><a href="/category/lawn-tractors/" class="">\u0421\u0430\u0434\u043e\u0432\u044b\u0435 \u0442\u0440\u0430\u043a\u0442\u043e\u0440\u04....
如何将其转换为可读字符串?
打印(或写入文件)后即可读取
>>> u = u'<div class="my_class"><ul><li class="parent">\n<a href="/category/tractors-ride-on-mowers/">\n\u0422\u0420\u0410\u041a\u0422\u041e\u0420\u042b \u0438 \u0420\u0410\u0419\u0414\u0415\u0420\u042b</a>\n<div class="sub1"><div class="str"></div><ul><li><a href="/category/lawn-tractors/" class="">\u0421\u0430\u0434\u043e\u0432\u044b\u0435 \u0442\u0440\u0430\u043a\u0442\u043e\u0440'
>>> print (u)
<div class="my_class"><ul><li class="parent">
<a href="/category/tractors-ride-on-mowers/">
ТРАКТОРЫ и РАЙДЕРЫ</a>
<div class="sub1"><div class="str"></div><ul><li><a href="/category/lawn-tractors/" class="">Садовые трактор
>>>
几点评论:
sel.xpath("//div[@class='my_class']")
选择div
个元素。sel.xpath("//div[@class='my_class']").extract()
为您提供所选元素的字符串表示形式 HTML,作为列表,如果文本节点在选择范围内,则 unicode 内容为\u
escape sequences包含 Unicode 代码点。
您也可以直接使用 XPath's string()
function 请求所选节点的字符串表示形式:
sel.xpath("string(//div[@class='my_class'])").extract()
或使用
text()
节点的常见字符串连接模式:"".join(sel.xpath("//div[@class='my_class']//text()").extract())
请注意,string()
将仅考虑与表达式匹配的第一个元素作为参数。来自 XPath 1.0 规范:
A node-set is converted to a string by returning the string-value of the node in the node-set that is first in document order.
示例 scrapy shell 会话:
$ scrapy shell
[s] Available Scrapy objects:
[s] crawler <scrapy.crawler.Crawler object at 0x7f06700bc2d0>
[s] item {}
[s] settings <scrapy.settings.Settings object at 0x7f06700b6f10>
[s] Useful shortcuts:
[s] shelp() Shell help (print this help)
[s] fetch(req_or_url) Fetch request (or URL) and update local objects
[s] view(response) View response in a browser
In [1]: import scrapy
In [2]: sel = scrapy.Selector(text=u'''<div class="my_class"><ul><li class="parent">\n<a href="/category/tractors-ride-on-mowers/">\n\u0422\u0420\u0410\u041a\u0422\u041e\u0420\u042b \u0438 \u0420\u0410\u0419\u0414\u0415\u0420\u042b</a>\n<div class="sub1"><div class="str"></div><ul><li><a href="/category/lawn-tractors/" class="">\u0421\u0430\u0434\u043e\u0432\u044b\u0435 \u0442\u0440\u0430\u043a\u0442\u043e\u0440''')
In [3]: print "".join(sel.xpath('//div[@class="my_class"]//text()').extract())
ТРАКТОРЫ и РАЙДЕРЫ
Садовые трактор
In [4]: for r in sel.xpath('string(//div[@class="my_class"])').extract():
print r
...:
ТРАКТОРЫ и РАЙДЕРЫ
Садовые трактор
In [5]: