使用 BeautifulSoup 从 html 获取特定文本

Question

我有这个 .html 代码：

<div id="content">
            <ul id="tree">
                <li xmlns="" class="level top failed open">
                    <span><em class="time">
                            <div class="time">1.89 s</div>
                        </em>I need to get this text</span>

我只需要获取所有其他标签之外的文本（文本是：我需要获取此文本）。

我正在尝试使用这段代码：

path = document.find('li', class_='level top').find_all("em")[-1].next_sibling
if not path:
    path = document.find('li', class_='level top failed open').find_all("em")[-1].next_sibling
return path

但是我得到一个错误：AttributeError: 'NoneType' object has no attribute 'find_all'.

有人知道如何访问此文本吗？

谢谢！

Answer 1

尝试使用这个方法：

.find_all("span", text=True)

因为文本在 span 元素中

Answer 2

您可以应用 .contents，它将生成一个输出列表，所需的是 [-1]

html = '''
<div id="content">
 <ul id="tree">
  <li class="level top failed open" xmlns="">
   <span>
    <em class="time">
     <div class="time">
      1.89 s
     </div>
    </em>
    I need to get this text
   </span>
  </li>
 </ul>
</div>

'''

from bs4 import BeautifulSoup
soup=BeautifulSoup(html,'html.parser')
#print(soup.prettify())

txt= soup.select_one('#tree > li > span').contents[-1]
print(txt)

输出：

  I need to get this text

使用 BeautifulSoup 从 html 获取特定文本

Getting a specific text from html using BeautifulSoup

html

python

text

beautifulsoup