如何获取 Python 中两个 html 标签之间的所有内容？

Question

我尝试从 html 页面上的一个主标签中提取所有内容（标签和文本）。例如：

`my_html_page = '''
<html>
    <body>
       <div class="post_body">
          <span class="polor">
             <a class="p-color">Some text</a>
             <a class="p-color">another text</a>
          </span>
          <a class="p-color">hello world</a>
          <p id="bold">
              some text inside p
             <ul>
                <li class="list">one li</li>
                <li>second li</li>
             </ul>
         </p>
         some text 2
         <div>
             text inside div
         </div>
         some text 3
      </div>
      <div class="post_body">
          <a>text inside second main div</a>
      </div>
      <div class="post_body">
          <span>third div</span>
      </div>
      <div class="post_body">
          <p>four div</p>
      </div>
      <div class="post">
          other text
      </div>
  </body>
<html>'''`

我需要使用 xpath("(//div[@class="post_body"])[1]"):

`
       <div class="post_body">
          <span class="polor">
             <a class="p-color">Some text</a>
             <a class="p-color">another text</a>
          </span>
          <a class="p-color">hello world</a>
          <p id="bold">
              some text inside p
             <ul>
                <li class="list">one li</li>
                <li>second li</li>
             </ul>
         </p>
         some text 2
         <div>
             text inside div
         </div>
         some text 3
      </div>
`

全部在标签内 <div class="post_body">

我读了 this topic，但没有帮助。

我需要在 lxml 中通过 beautifulsoup 解析器创建 DOM。

 import lxml.html.soupparser
 import lxml.html
 text_inside_tag = lxml.html.soupparser.fromstring(my_html_page)
 text = text_inside_tag.xpath('(//div[@class="post_body"])[1]/text()')

而且我只能提取标签内的文本，但我需要提取带有标签的文本。

如果我尝试使用这个：

for elem in text.xpath("(//div[@class="post_body"])[1]/text()"):
   print lxml.html.tostring(elem, pretty_print=True)

我有错误：TypeError: Type '_ElementStringResult' cannot be serialized.

请帮忙。

Answer 1

你可以这样试试:

import lxml.html.soupparser
import lxml.html

my_html_page = '''...some html markup here...'''
root = lxml.html.soupparser.fromstring(my_html_page)

for elem in root.xpath("//div[@class='post_body']"):
    result = elem.text + ''.join(lxml.html.tostring(e, pretty_print=True) for e in elem)
    print result

result 通过将父节点 <div> 中的文本节点与所有子节点的标记组合而成的变量。

如何获取 Python 中两个 html 标签之间的所有内容？

How do I get all content between two html tags in Python?

python

xml

xpath

lxml

beautifulsoup