xpath

Question

我需要从页面上的一个标签中提取带有文本的 html 个标签。例如：

<html>
 <body>
  <div class="post">
   text <p> text </p> text <a> text </a>
   <span> text </span>
  <div class="post">
   another text <p> text </p>
 </body>
</html>

我首先需要 html <div class="post"> :

text <p> text </p> text <a> text </a>
   <span> text </span>

有标签。

我只能使用 xpath 提取文本："(//div[@class="post"])[1]/descendant-or-self::*[not(name()="script")]/text()" 结果 = text text text text text

我试过："(//div[@class="post_body"])[1]/node()"但我不知道如何从中创建字符串。

P.S。或者换种方式提示，比如(BeautifulSoup) 请帮忙。

Answer 1

使用find()方法得到第一个div.

from bs4 import BeautifulSoup   
soup = BeautifulSoup("""<html>
     <body>
      <div class="post">
       text <p> text </p> text <a> text </a>
       <span> text </span></div>
      <div class="post">
       another text <p> text </p></div>
     </body>
    </html>""")

first_div_text = [child.strip() if isinstance(child, str) else str(child)  for child in soup.find('div', attrs={'class': 'post'})]
print(''.join(first_div_text))

输出

text<p> text </p>text<a> text </a><span> text </span>

xpath - 如何从一个标签中提取 html？

xpath - how extract html from one tag?

html

lxml

beautifulsoup