在 Scrapy 中使用 Xpath select 段落下方的任何文本
Using Xpath in Scrapy to select any text below paragraph
嗯,我的初始代码可以工作,但遗漏了网站中一些奇怪的格式:
response.xpath("//*[contains(., 'Description:')]/following-sibling::p/text()").extract()
<div id="body">
<a name="main_content" id="main_content"></a>
<!-- InstanceBeginEditable name="main_content" -->
<div class="return_to_div"><a href="../../index.html">HOME</a> | <a href="../index.html">DEATH ROW</a> | <a href="index.html">INFORMATION</a> | text</div>
<h1>text</h1>
<h2>text</h2>
<p class="text_bold">text:</p>
<p>text</p>
<p class="text_bold">text:</p>
<p>text</p>
<p class="text_bold">Description:</p>
<p>Line1</p>
<p>Line2</p>
Line3 <!-- InstanceEndEditable -->
</div>
我拉 1 号线和 2 号线没问题。但是 3 号线不是我 P class 的兄弟。这只发生在我试图从 table.
中删除的一些页面上
这里是link:https://www.tdcj.state.tx.us/death_row/dr_info/wardadamlast.html
抱歉,Xpath 让我感到困惑,有没有一种方法可以提取符合条件 //*[contains(., 'Description:')]
而不是必须是兄弟姐妹的所有数据?
提前致谢。
已编辑:更改示例以更反映实际情况。已将 link 添加到原始页面。
您可以 select 之后 <p>
包含 "Description:" 的所有兄弟节点(元素和文本节点)(following-sibling::node()
),然后获取所有文本节点(descendant-or-self::text()
):
>>> import scrapy
>>> response = scrapy.Selector(text="""<div>
... <p> Name </p>
... <p> Age </p>
... <p class="text-bold"> Description: </p>
... <p> Line 1 </p>
... <p> Line 2 </p>
... Line 3
... </div>""", type="html")
>>> response.xpath("""//div/p[contains(., 'Description:')]
... /following-sibling::node()
... /descendant-or-self::text()""").extract()
[u'\n ', u' Line 1 ', u'\n ', u' Line 2 ', u'\nLine 3\n']
>>>
让我们分解一下。
因此,您已经知道如何找到包含 "Description" 的正确 <p>
(使用 XPath //div/p[contains(., 'Description:')]
):
>>> response.xpath("//div/p[contains(., 'Description:')]").extract()
[u'<p class="text-bold"> Description: </p>']
你想要在 (following-sibling::
axis + p
element selection 之后的 <p>
s:
>>> response.xpath("//div/p[contains(., 'Description:')]/following-sibling::p").extract()
[u'<p> Line 1 </p>', u'<p> Line 2 </p>']
这不会给您第 3 行。所以你阅读了 XPath 并尝试 "catch-all" *
:
>>> response.xpath("//div/p[contains(., 'Description:')]/following-sibling::*").extract()
[u'<p> Line 1 </p>', u'<p> Line 2 </p>']
仍然没有运气。为什么?因为*
只有select个元素(为了简化,通常称为"tags")。
您之后的第 3 行是一个文本节点,是父元素 <div>
的子节点。但是文本节点也是一个节点 (!),因此您可以 select 它作为上面那个著名 <p>
的兄弟节点:
>>> response.xpath("//div/p[contains(., 'Description:')]/following-sibling::node()").extract()
[u'\n ', u'<p> Line 1 </p>', u'\n ', u'<p> Line 2 </p>', u'\nLine 3\n']
好的,现在看来我们有了想要的节点("tag" 元素和文本节点)。但是您仍然在 .extract()
的输出中得到了那些“<p>
”(XPath selected 元素,而不是它们的 "inner" 文本)。
因此您更多地了解了 XPath 并使用了 .//text()
步骤(大约 "all children text nodes from here")
>>> response.xpath("//div/p[contains(., 'Description:')]/following-sibling::node()//text()").extract()
[u' Line 1 ', u' Line 2 ']
呃,等等,第三行去哪儿了?
事实上 //
是 /descendant-or-self::node()/
的缩写,因此 ./descendant-or-self::node()/text()
将 select 仅是下一个 <p>
的子文本节点](文本节点没有子节点,self::text()/text()
永远不会匹配任何文本节点)
>>> response.xpath("//div/p[contains(., 'Description:')]/following-sibling::node()/descendant-or-self::node()/text()").extract()
[u' Line 1 ', u' Line 2 ']
你在这里可以做的是使用方便的 descendant-or-self
轴 + text()
节点测试,所以如果 following-sibling::node()
到达文本节点, "self" in descendant-or-self
将匹配文本节点,并且 text()
节点测试为 true
>>> response.xpath("//div/p[contains(., 'Description:')]/following-sibling::node()/descendant-or-self::text()").extract()
[u'\n ', u' Line 1 ', u'\n ', u' Line 2 ', u'\nLine 3\n']
使用 OP 编辑问题中的示例 URL:
$ scrapy shell https://www.tdcj.state.tx.us/death_row/dr_info/wardadamlast.html
2016-05-19 13:14:44 [scrapy] INFO: Scrapy 1.1.0 started (bot: scrapybot)
2016-05-19 13:14:44 [scrapy] INFO: Overridden settings: {'LOGSTATS_INTERVAL': 0, 'DUPEFILTER_CLASS': 'scrapy.dupefilters.BaseDupeFilter'}
(...)
2016-05-19 13:14:48 [scrapy] INFO: Spider opened
2016-05-19 13:14:50 [scrapy] DEBUG: Crawled (200) <GET https://www.tdcj.state.tx.us/death_row/dr_info/wardadamlast.html> (referer: None)
>>> t = response.xpath("""
... //div/p[contains(., 'Last Statement:')]
... /following-sibling::node()
... /descendant-or-self::text()""").extract()
>>>
>>>
>>> print(''.join(t))
I would like to thank everyone that has showed up on my behalf, Kathryn Cox, I love you dearly. Thank you Randy Cannon for showing up and being a lifelong friend. Thank you Dr. Steve Ball for trying to bring the right out. There are a lot of injustices that are happening with this. This is wrong. Thank you Reverend Leon Harrison for showing me the grace of God. Thank you for all of my friends that are out there. This is not a capital case. I never had intended to do anything. I feel very grieved for the loss of Walker, and for Donovan and Marissa Walker. I hope they can find peace and be productive in society. I would like to thank all of my friends on the row even though everything didn’t work, close isn’t good enough. I hope that positive change will come out of this.
I would like to thank my father and mother for everything that they showed me. I would like to apologize for putting them through this. I would like to ask for the truth to come out and make positive changes. Above all else Donovan and Marissa can find love and peace. I hope they overcome the loss of their father. At no time did I intend to hurt him.
When the truth comes out I hope that they can find closure. There are a lot of things that are not right in this world, I have had to overcome them myself. I hope all that are on the row, I hope they find peace and solace in their life. Everyone can find peace in a Christian God or whatever God they believe in. I thank you mom and dad for everything, I love you dearly. One last thing, I thank all of my friends that showed loyalty and graced my life with more positive. I would also like to thank Gustav’s mother for having such a great son, and showing me much love. I have met good people on the row, not all of them are bad. I hope everyone can see that. I just want to thank everybody that came to witness this. I thank everyone, I am sorry things didn’t work out. May God forgive us all? I am sorry mother and I am sorry father. I hope you find peace and solace in your heart. I know there is something else I need to say. I feel that.
嗯,我的初始代码可以工作,但遗漏了网站中一些奇怪的格式:
response.xpath("//*[contains(., 'Description:')]/following-sibling::p/text()").extract()
<div id="body">
<a name="main_content" id="main_content"></a>
<!-- InstanceBeginEditable name="main_content" -->
<div class="return_to_div"><a href="../../index.html">HOME</a> | <a href="../index.html">DEATH ROW</a> | <a href="index.html">INFORMATION</a> | text</div>
<h1>text</h1>
<h2>text</h2>
<p class="text_bold">text:</p>
<p>text</p>
<p class="text_bold">text:</p>
<p>text</p>
<p class="text_bold">Description:</p>
<p>Line1</p>
<p>Line2</p>
Line3 <!-- InstanceEndEditable -->
</div>
我拉 1 号线和 2 号线没问题。但是 3 号线不是我 P class 的兄弟。这只发生在我试图从 table.
中删除的一些页面上这里是link:https://www.tdcj.state.tx.us/death_row/dr_info/wardadamlast.html
抱歉,Xpath 让我感到困惑,有没有一种方法可以提取符合条件 //*[contains(., 'Description:')]
而不是必须是兄弟姐妹的所有数据?
提前致谢。
已编辑:更改示例以更反映实际情况。已将 link 添加到原始页面。
您可以 select 之后 <p>
包含 "Description:" 的所有兄弟节点(元素和文本节点)(following-sibling::node()
),然后获取所有文本节点(descendant-or-self::text()
):
>>> import scrapy
>>> response = scrapy.Selector(text="""<div>
... <p> Name </p>
... <p> Age </p>
... <p class="text-bold"> Description: </p>
... <p> Line 1 </p>
... <p> Line 2 </p>
... Line 3
... </div>""", type="html")
>>> response.xpath("""//div/p[contains(., 'Description:')]
... /following-sibling::node()
... /descendant-or-self::text()""").extract()
[u'\n ', u' Line 1 ', u'\n ', u' Line 2 ', u'\nLine 3\n']
>>>
让我们分解一下。
因此,您已经知道如何找到包含 "Description" 的正确 <p>
(使用 XPath //div/p[contains(., 'Description:')]
):
>>> response.xpath("//div/p[contains(., 'Description:')]").extract()
[u'<p class="text-bold"> Description: </p>']
你想要在 (following-sibling::
axis + p
element selection 之后的 <p>
s:
>>> response.xpath("//div/p[contains(., 'Description:')]/following-sibling::p").extract()
[u'<p> Line 1 </p>', u'<p> Line 2 </p>']
这不会给您第 3 行。所以你阅读了 XPath 并尝试 "catch-all" *
:
>>> response.xpath("//div/p[contains(., 'Description:')]/following-sibling::*").extract()
[u'<p> Line 1 </p>', u'<p> Line 2 </p>']
仍然没有运气。为什么?因为*
只有select个元素(为了简化,通常称为"tags")。
您之后的第 3 行是一个文本节点,是父元素 <div>
的子节点。但是文本节点也是一个节点 (!),因此您可以 select 它作为上面那个著名 <p>
的兄弟节点:
>>> response.xpath("//div/p[contains(., 'Description:')]/following-sibling::node()").extract()
[u'\n ', u'<p> Line 1 </p>', u'\n ', u'<p> Line 2 </p>', u'\nLine 3\n']
好的,现在看来我们有了想要的节点("tag" 元素和文本节点)。但是您仍然在 .extract()
的输出中得到了那些“<p>
”(XPath selected 元素,而不是它们的 "inner" 文本)。
因此您更多地了解了 XPath 并使用了 .//text()
步骤(大约 "all children text nodes from here")
>>> response.xpath("//div/p[contains(., 'Description:')]/following-sibling::node()//text()").extract()
[u' Line 1 ', u' Line 2 ']
呃,等等,第三行去哪儿了?
事实上 //
是 /descendant-or-self::node()/
的缩写,因此 ./descendant-or-self::node()/text()
将 select 仅是下一个 <p>
的子文本节点](文本节点没有子节点,self::text()/text()
永远不会匹配任何文本节点)
>>> response.xpath("//div/p[contains(., 'Description:')]/following-sibling::node()/descendant-or-self::node()/text()").extract()
[u' Line 1 ', u' Line 2 ']
你在这里可以做的是使用方便的 descendant-or-self
轴 + text()
节点测试,所以如果 following-sibling::node()
到达文本节点, "self" in descendant-or-self
将匹配文本节点,并且 text()
节点测试为 true
>>> response.xpath("//div/p[contains(., 'Description:')]/following-sibling::node()/descendant-or-self::text()").extract()
[u'\n ', u' Line 1 ', u'\n ', u' Line 2 ', u'\nLine 3\n']
使用 OP 编辑问题中的示例 URL:
$ scrapy shell https://www.tdcj.state.tx.us/death_row/dr_info/wardadamlast.html
2016-05-19 13:14:44 [scrapy] INFO: Scrapy 1.1.0 started (bot: scrapybot)
2016-05-19 13:14:44 [scrapy] INFO: Overridden settings: {'LOGSTATS_INTERVAL': 0, 'DUPEFILTER_CLASS': 'scrapy.dupefilters.BaseDupeFilter'}
(...)
2016-05-19 13:14:48 [scrapy] INFO: Spider opened
2016-05-19 13:14:50 [scrapy] DEBUG: Crawled (200) <GET https://www.tdcj.state.tx.us/death_row/dr_info/wardadamlast.html> (referer: None)
>>> t = response.xpath("""
... //div/p[contains(., 'Last Statement:')]
... /following-sibling::node()
... /descendant-or-self::text()""").extract()
>>>
>>>
>>> print(''.join(t))
I would like to thank everyone that has showed up on my behalf, Kathryn Cox, I love you dearly. Thank you Randy Cannon for showing up and being a lifelong friend. Thank you Dr. Steve Ball for trying to bring the right out. There are a lot of injustices that are happening with this. This is wrong. Thank you Reverend Leon Harrison for showing me the grace of God. Thank you for all of my friends that are out there. This is not a capital case. I never had intended to do anything. I feel very grieved for the loss of Walker, and for Donovan and Marissa Walker. I hope they can find peace and be productive in society. I would like to thank all of my friends on the row even though everything didn’t work, close isn’t good enough. I hope that positive change will come out of this.
I would like to thank my father and mother for everything that they showed me. I would like to apologize for putting them through this. I would like to ask for the truth to come out and make positive changes. Above all else Donovan and Marissa can find love and peace. I hope they overcome the loss of their father. At no time did I intend to hurt him.
When the truth comes out I hope that they can find closure. There are a lot of things that are not right in this world, I have had to overcome them myself. I hope all that are on the row, I hope they find peace and solace in their life. Everyone can find peace in a Christian God or whatever God they believe in. I thank you mom and dad for everything, I love you dearly. One last thing, I thank all of my friends that showed loyalty and graced my life with more positive. I would also like to thank Gustav’s mother for having such a great son, and showing me much love. I have met good people on the row, not all of them are bad. I hope everyone can see that. I just want to thank everybody that came to witness this. I thank everyone, I am sorry things didn’t work out. May God forgive us all? I am sorry mother and I am sorry father. I hope you find peace and solace in your heart. I know there is something else I need to say. I feel that.