使用 BeautifulSoup 从 html 中查找文本

Finding text from html using BeautifulSoup

我有以下内容。html:

<li class="print text">
                            <span><em class="time">
                                    <div class="time">1.29 s</div>
                                </em><em class="status">passed</em>This is the text I want to get</span>

我只需要获取所有其他标签之外的文本(文本是:This is the text I want to get)。

我正在尝试使用这段代码:

for el in doc.find_all('li', attrs={'class': 'print text'}):
    print(el.get_text())

但不幸的是,它打印了所有内容,包括 em 标签等。

有什么办法吗?

谢谢!!

使用 class 查找特定的 li 标签,并在 em 标签上使用 find_all 方法,使用索引和 next-sibling 方法从列表中获取最后一个标签return 文字

from bs4 import BeautifulSoup
soup="""<li class="print text">
        <span><em class="time">
                <div class="time">1.29 s</div>
            </em><em class="status">passed</em>This is the text I want to get</span>"""

soup=BeautifulSoup(soup)
soup.find("li",class_="print text").find_all("em")[-1].next_sibling

您可以选择 find(text=True, recursive=False) 来实现您的目标。

例子
from bs4 import BeautifulSoup
soup='''<li class="print text">
        <span><em class="time">
                <div class="time">1.29 s</div>
            </em><em class="status">passed</em>This is the text I want to get</span>'''

soup=BeautifulSoup(soup)

soup.find('li',class_='print text').span.find(text=True, recursive=False)

输出

This is the text I want to get

如果您的 li 中有多个 span,您可以选择:

from bs4 import BeautifulSoup
soup='''<li class="print text">
        <span><em class="time">
                <div class="time">1.29 s</div>
            </em><em class="status">passed</em>This is the text I want to get</span>
            <span><em class="time">
                <div class="time">1.50 s</div>
            </em><em class="status">passed</em>This is the text I want to get too</span>'''

soup=BeautifulSoup(soup)

for e in soup.select('li.print.text span'):
    print(e.find(text=True, recursive=False))
输出
This is the text I want to get
This is the text I want to get too