在 Scrapy 中使用 Xpath select 段落下方的任何文本

Question

嗯，我的初始代码可以工作，但遗漏了网站中一些奇怪的格式：

response.xpath("//*[contains(., 'Description:')]/following-sibling::p/text()").extract()


  <div id="body">
  <a name="main_content" id="main_content"></a>
  <!-- InstanceBeginEditable name="main_content" -->
<div class="return_to_div"><a href="../../index.html">HOME</a>  | <a href="../index.html">DEATH ROW</a>  | <a href="index.html">INFORMATION</a>  | text</div>
<h1>text</h1>
<h2>text</h2>
<p class="text_bold">text:</p>
<p>text</p>
<p class="text_bold">text:</p>
<p>text</p>
<p class="text_bold">Description:</p>
<p>Line1</p>
<p>Line2</p>
Line3  <!-- InstanceEndEditable -->  
  </div>

我拉 1 号线和 2 号线没问题。但是 3 号线不是我 P class 的兄弟。这只发生在我试图从 table.

中删除的一些页面上

这里是link：https://www.tdcj.state.tx.us/death_row/dr_info/wardadamlast.html

抱歉，Xpath 让我感到困惑，有没有一种方法可以提取符合条件 //*[contains(., 'Description:')] 而不是必须是兄弟姐妹的所有数据？

提前致谢。

已编辑：更改示例以更反映实际情况。已将 link 添加到原始页面。

Answer 1

您可以 select 之后  包含 "Description:" 的所有兄弟节点（元素和文本节点）（following-sibling::node()），然后获取所有文本节点（descendant-or-self::text()):

>>> import scrapy
>>> response = scrapy.Selector(text="""<div>
...  <p> Name </p>
...  <p> Age  </p>
...  <p class="text-bold"> Description: </p>
...  <p> Line 1 </p>
...  <p> Line 2 </p>
... Line 3
... </div>""", type="html")
>>> response.xpath("""//div/p[contains(., 'Description:')]
...      /following-sibling::node()
...         /descendant-or-self::text()""").extract()
[u'\n ', u' Line 1 ', u'\n ', u' Line 2 ', u'\nLine 3\n']
>>>

让我们分解一下。

因此，您已经知道如何找到包含 "Description" 的正确 （使用 XPath //div/p[contains(., 'Description:')]）：

>>> response.xpath("//div/p[contains(., 'Description:')]").extract()
[u'<p class="text-bold"> Description: </p>']

你想要在 (following-sibling:: axis + p element selection 之后的 s:

>>> response.xpath("//div/p[contains(., 'Description:')]/following-sibling::p").extract()
[u'<p> Line 1 </p>', u'<p> Line 2 </p>']

这不会给您第 3 行。所以你阅读了 XPath 并尝试 "catch-all" *:

>>> response.xpath("//div/p[contains(., 'Description:')]/following-sibling::*").extract()
[u'<p> Line 1 </p>', u'<p> Line 2 </p>']

仍然没有运气。为什么？因为*只有select个元素（为了简化，通常称为"tags"）。

您之后的第 3 行是一个文本节点，是父元素 <div> 的子节点。但是文本节点也是一个节点 (!)，因此您可以 select 它作为上面那个著名  的兄弟节点：

>>> response.xpath("//div/p[contains(., 'Description:')]/following-sibling::node()").extract()
[u'\n ', u'<p> Line 1 </p>', u'\n ', u'<p> Line 2 </p>', u'\nLine 3\n']

好的，现在看来我们有了想要的节点（"tag" 元素和文本节点）。但是您仍然在 .extract() 的输出中得到了那些“”（XPath selected 元素，而不是它们的 "inner" 文本）。

因此您更多地了解了 XPath 并使用了 .//text() 步骤（大约 "all children text nodes from here"）

>>> response.xpath("//div/p[contains(., 'Description:')]/following-sibling::node()//text()").extract()
[u' Line 1 ', u' Line 2 ']

呃，等等，第三行去哪儿了？

事实上 // 是 /descendant-or-self::node()/ 的缩写，因此 ./descendant-or-self::node()/text() 将 select 仅是下一个  的子文本节点]（文本节点没有子节点，self::text()/text()永远不会匹配任何文本节点）

>>> response.xpath("//div/p[contains(., 'Description:')]/following-sibling::node()/descendant-or-self::node()/text()").extract()
[u' Line 1 ', u' Line 2 ']

你在这里可以做的是使用方便的 descendant-or-self 轴 + text() 节点测试，所以如果 following-sibling::node() 到达文本节点， "self" in descendant-or-self 将匹配文本节点，并且 text() 节点测试为 true

>>> response.xpath("//div/p[contains(., 'Description:')]/following-sibling::node()/descendant-or-self::text()").extract()
[u'\n ', u' Line 1 ', u'\n ', u' Line 2 ', u'\nLine 3\n']

使用 OP 编辑问题中的示例 URL：

$ scrapy shell https://www.tdcj.state.tx.us/death_row/dr_info/wardadamlast.html
2016-05-19 13:14:44 [scrapy] INFO: Scrapy 1.1.0 started (bot: scrapybot)
2016-05-19 13:14:44 [scrapy] INFO: Overridden settings: {'LOGSTATS_INTERVAL': 0, 'DUPEFILTER_CLASS': 'scrapy.dupefilters.BaseDupeFilter'}
(...)
2016-05-19 13:14:48 [scrapy] INFO: Spider opened
2016-05-19 13:14:50 [scrapy] DEBUG: Crawled (200) <GET https://www.tdcj.state.tx.us/death_row/dr_info/wardadamlast.html> (referer: None)

>>> t = response.xpath("""
...     //div/p[contains(., 'Last Statement:')]
...         /following-sibling::node()
...             /descendant-or-self::text()""").extract()
>>> 
>>> 
>>> print(''.join(t))

I would like to thank everyone that has showed up on my  behalf, Kathryn Cox, I love you dearly.  Thank you Randy Cannon for  showing up and being a lifelong friend.  Thank you Dr. Steve Ball for  trying to bring the right out.  There are a lot of injustices that are  happening with this.  This is wrong.  Thank you Reverend Leon  Harrison for showing me the grace of God.  Thank you for all of my friends  that are out there.  This is not a capital case.  I never had  intended to do anything.  I feel very grieved for the loss of Walker, and  for Donovan and Marissa Walker.  I hope they can find peace and be  productive in society.  I would like to thank all of my friends on the row  even though everything didn’t work, close isn’t good enough.  I hope that  positive change will come out of this.
I would like to thank my father and mother for everything  that they showed me.  I would like to apologize for putting them through  this.  I would like to ask for the truth to come out and make positive  changes.  Above all else Donovan and Marissa can find love and  peace.  I hope they overcome the loss of their father.  At no time  did I intend to hurt him.
When  the truth comes out I hope that they can find closure.  There are a lot of  things that are not right in this world, I have had to overcome them  myself.  I hope all that are on the row, I hope they find peace and solace  in their life. Everyone can find peace in a Christian God or whatever God they  believe in.  I thank you mom and dad for everything, I love you  dearly.  One last thing, I thank all of my friends that showed loyalty and  graced my life with more positive.  I would also like to thank Gustav’s  mother for having such a great son, and showing me much love.  I have met  good people on the row, not all of them are bad.  I hope everyone can see  that.  I just want to thank everybody that came to witness this.  I  thank everyone, I am sorry things didn’t work out.  May God forgive us  all?  I am sorry mother and I am sorry father.  I hope you find peace  and solace in your heart.  I know there is something else I need to  say.  I feel that.

在 Scrapy 中使用 Xpath select 段落下方的任何文本

Using Xpath in Scrapy to select any text below paragraph

python

xpath

scrapy

web-scraping

scrapy-spider