如何使用scrapy获取匹配的行号
How to get the line number of a match with scrapy
使用以下示例:
$ scrapy shell http://doc.scrapy.org/en/latest/_static/selectors-sample1.html
其中 selectors-sample1-html
是:
<html>
<head>
<base href='http://example.com/' />
<title>Example website</title>
</head>
<body>
<div id='images'>
<a href='image1.html'>Name: My image 1 <br /><img src='image1_thumb.jpg' /></a>
<a href='image2.html'>Name: My image 2 <br /><img src='image2_thumb.jpg' /></a>
<a href='image3.html'>Name: My image 3 <br /><img src='image3_thumb.jpg' /></a>
<a href='image4.html'>Name: My image 4 <br /><img src='image4_thumb.jpg' /></a>
<a href='image5.html'>Name: My image 5 <br /><img src='image5_thumb.jpg' /></a>
</div>
</body>
</html>
是否可以使用Scrapy 1.1.2获取匹配的行号?例如,类似于:
$ response.selector.xpath('//title/text()').some_magic_to_get_line_number
$ # should output 4
谢谢!
我不知道如何获取文本节点的源代码行,但对于元素节点,您可以侵入选择器的底层 lxml 对象(使用 .root
),然后访问 .sourceline
属性:
$ scrapy shell http://doc.scrapy.org/en/latest/_static/selectors-sample1.html
2016-09-08 18:13:12 [scrapy] INFO: Scrapy 1.1.2 started (bot: scrapybot)
2016-09-08 18:13:12 [scrapy] INFO: Spider opened
2016-09-08 18:13:13 [scrapy] DEBUG: Crawled (200) <GET http://doc.scrapy.org/en/latest/_static/selectors-sample1.html> (referer: None)
>>> response.selector.xpath('//title/text()')
[<Selector xpath='//title/text()' data=u'Example website'>]
>>> s = response.selector.xpath('//title/text()')[0]
>>> type(s)
<class 'scrapy.selector.unified.Selector'>
>>> type(s.root)
<type 'str'>
>>> s = response.selector.xpath('//title')[0]
>>> s.root
<Element title at 0x7fa95d3f1908>
>>> type(s.root)
<type 'lxml.etree._Element'>
>>> dir(s.root)
['__class__', '__contains__', '__copy__', '__deepcopy__', '__delattr__', '__delitem__', '__doc__', '__format__', '__getattribute__', '__getitem__', '__hash__', '__init__', '__iter__', '__len__', '__new__', '__nonzero__', '__reduce__', '__reduce_ex__', '__repr__', '__reversed__', '__setattr__', '__setitem__', '__sizeof__', '__str__', '__subclasshook__', '_init', 'addnext', 'addprevious', 'append', 'attrib', 'base', 'clear', 'cssselect', 'extend', 'find', 'findall', 'findtext', 'get', 'getchildren', 'getiterator', 'getnext', 'getparent', 'getprevious', 'getroottree', 'index', 'insert', 'items', 'iter', 'iterancestors', 'iterchildren', 'iterdescendants', 'iterfind', 'itersiblings', 'itertext', 'keys', 'makeelement', 'nsmap', 'prefix', 'remove', 'replace', 'set', 'sourceline', 'tag', 'tail', 'text', 'values', 'xpath']
>>> s.root.sourceline
4
>>>
使用以下示例:
$ scrapy shell http://doc.scrapy.org/en/latest/_static/selectors-sample1.html
其中 selectors-sample1-html
是:
<html>
<head>
<base href='http://example.com/' />
<title>Example website</title>
</head>
<body>
<div id='images'>
<a href='image1.html'>Name: My image 1 <br /><img src='image1_thumb.jpg' /></a>
<a href='image2.html'>Name: My image 2 <br /><img src='image2_thumb.jpg' /></a>
<a href='image3.html'>Name: My image 3 <br /><img src='image3_thumb.jpg' /></a>
<a href='image4.html'>Name: My image 4 <br /><img src='image4_thumb.jpg' /></a>
<a href='image5.html'>Name: My image 5 <br /><img src='image5_thumb.jpg' /></a>
</div>
</body>
</html>
是否可以使用Scrapy 1.1.2获取匹配的行号?例如,类似于:
$ response.selector.xpath('//title/text()').some_magic_to_get_line_number
$ # should output 4
谢谢!
我不知道如何获取文本节点的源代码行,但对于元素节点,您可以侵入选择器的底层 lxml 对象(使用 .root
),然后访问 .sourceline
属性:
$ scrapy shell http://doc.scrapy.org/en/latest/_static/selectors-sample1.html
2016-09-08 18:13:12 [scrapy] INFO: Scrapy 1.1.2 started (bot: scrapybot)
2016-09-08 18:13:12 [scrapy] INFO: Spider opened
2016-09-08 18:13:13 [scrapy] DEBUG: Crawled (200) <GET http://doc.scrapy.org/en/latest/_static/selectors-sample1.html> (referer: None)
>>> response.selector.xpath('//title/text()')
[<Selector xpath='//title/text()' data=u'Example website'>]
>>> s = response.selector.xpath('//title/text()')[0]
>>> type(s)
<class 'scrapy.selector.unified.Selector'>
>>> type(s.root)
<type 'str'>
>>> s = response.selector.xpath('//title')[0]
>>> s.root
<Element title at 0x7fa95d3f1908>
>>> type(s.root)
<type 'lxml.etree._Element'>
>>> dir(s.root)
['__class__', '__contains__', '__copy__', '__deepcopy__', '__delattr__', '__delitem__', '__doc__', '__format__', '__getattribute__', '__getitem__', '__hash__', '__init__', '__iter__', '__len__', '__new__', '__nonzero__', '__reduce__', '__reduce_ex__', '__repr__', '__reversed__', '__setattr__', '__setitem__', '__sizeof__', '__str__', '__subclasshook__', '_init', 'addnext', 'addprevious', 'append', 'attrib', 'base', 'clear', 'cssselect', 'extend', 'find', 'findall', 'findtext', 'get', 'getchildren', 'getiterator', 'getnext', 'getparent', 'getprevious', 'getroottree', 'index', 'insert', 'items', 'iter', 'iterancestors', 'iterchildren', 'iterdescendants', 'iterfind', 'itersiblings', 'itertext', 'keys', 'makeelement', 'nsmap', 'prefix', 'remove', 'replace', 'set', 'sourceline', 'tag', 'tail', 'text', 'values', 'xpath']
>>> s.root.sourceline
4
>>>