如何使用 scrapy Selector 获取节点的 innerHTML？

Question

假设有一些 html 片段，例如：

<a>
   text in a
   <b>text in b</b>
   <c>text in c</c>
</a>
<a>
   <b>text in b</b>
   text in a
   <c>text in c</c>
</a>

其中我想提取标签内的文本，但在保留文本的同时排除这些标签，例如，我想在上面提取的内容类似于 "text in a text in b text in c" 和 "text in b text in a text inc"。现在我可以使用 scrapy Selector css() 函数获取节点，那么我该如何处理这些节点以获得我想要的呢？任何想法将不胜感激，谢谢！

Answer 1

这是我设法做到的：

from scrapy.selector import Selector

sel = Selector(text = html_string)

for node in sel.css('a *::text'):
    print node.extract()

假设 html_string 是一个包含问题中的 html 的变量，此代码产生以下输出：

   text in a

text in b


text in c




text in b

   text in a

text in c

选择器 a *::text() 匹配所有 a 节点的后代文本节点。

Answer 2

您可以对 select:

的元素使用 XPath's string() 函数

$ python
>>> import scrapy
>>> selector = scrapy.Selector(text="""<a>
...    text in a
...    <b>text in b</b>
...    <c>text in c</c>
... </a>
... <a>
...    <b>text in b</b>
...    text in a
...    <c>text in c</c>
... </a>""", type="html")
>>> for link in selector.css('a'):
...     print link.xpath('string(.)').extract()
... 
[u'\n   text in a\n   text in b\n   text in c\n']
[u'\n   text in b\n   text in a\n   text in c\n']
>>>

Answer 3

在scrapy 1.5中，可以使用/*获取innerhtml。示例：

content = response.xpath('//div[@class="viewbox"]/div[@class="content"]/*').extract_first()

Answer 4

试试这个

response.xpath('//a/node()').extract()

如何使用 scrapy Selector 获取节点的 innerHTML？

How to get innerHTML of a node using scrapy Selector?

html

python

xpath

css-selectors

scrapy