使用 BeautifulSoup 从 html 中查找文本
Finding text from html using BeautifulSoup
我有以下内容。html:
<li class="print text">
<span><em class="time">
<div class="time">1.29 s</div>
</em><em class="status">passed</em>This is the text I want to get</span>
我只需要获取所有其他标签之外的文本(文本是:This is the text I want to get)。
我正在尝试使用这段代码:
for el in doc.find_all('li', attrs={'class': 'print text'}):
print(el.get_text())
但不幸的是,它打印了所有内容,包括 em 标签等。
有什么办法吗?
谢谢!!
使用 class
查找特定的 li
标签,并在 em
标签上使用 find_all
方法,使用索引和 next-sibling
方法从列表中获取最后一个标签return 文字
from bs4 import BeautifulSoup
soup="""<li class="print text">
<span><em class="time">
<div class="time">1.29 s</div>
</em><em class="status">passed</em>This is the text I want to get</span>"""
soup=BeautifulSoup(soup)
soup.find("li",class_="print text").find_all("em")[-1].next_sibling
您可以选择 find(text=True, recursive=False)
来实现您的目标。
例子
from bs4 import BeautifulSoup
soup='''<li class="print text">
<span><em class="time">
<div class="time">1.29 s</div>
</em><em class="status">passed</em>This is the text I want to get</span>'''
soup=BeautifulSoup(soup)
soup.find('li',class_='print text').span.find(text=True, recursive=False)
输出
This is the text I want to get
如果您的 li
中有多个 span
,您可以选择:
from bs4 import BeautifulSoup
soup='''<li class="print text">
<span><em class="time">
<div class="time">1.29 s</div>
</em><em class="status">passed</em>This is the text I want to get</span>
<span><em class="time">
<div class="time">1.50 s</div>
</em><em class="status">passed</em>This is the text I want to get too</span>'''
soup=BeautifulSoup(soup)
for e in soup.select('li.print.text span'):
print(e.find(text=True, recursive=False))
输出
This is the text I want to get
This is the text I want to get too
我有以下内容。html:
<li class="print text">
<span><em class="time">
<div class="time">1.29 s</div>
</em><em class="status">passed</em>This is the text I want to get</span>
我只需要获取所有其他标签之外的文本(文本是:This is the text I want to get)。
我正在尝试使用这段代码:
for el in doc.find_all('li', attrs={'class': 'print text'}):
print(el.get_text())
但不幸的是,它打印了所有内容,包括 em 标签等。
有什么办法吗?
谢谢!!
使用 class
查找特定的 li
标签,并在 em
标签上使用 find_all
方法,使用索引和 next-sibling
方法从列表中获取最后一个标签return 文字
from bs4 import BeautifulSoup
soup="""<li class="print text">
<span><em class="time">
<div class="time">1.29 s</div>
</em><em class="status">passed</em>This is the text I want to get</span>"""
soup=BeautifulSoup(soup)
soup.find("li",class_="print text").find_all("em")[-1].next_sibling
您可以选择 find(text=True, recursive=False)
来实现您的目标。
例子
from bs4 import BeautifulSoup
soup='''<li class="print text">
<span><em class="time">
<div class="time">1.29 s</div>
</em><em class="status">passed</em>This is the text I want to get</span>'''
soup=BeautifulSoup(soup)
soup.find('li',class_='print text').span.find(text=True, recursive=False)
输出
This is the text I want to get
如果您的 li
中有多个 span
,您可以选择:
from bs4 import BeautifulSoup
soup='''<li class="print text">
<span><em class="time">
<div class="time">1.29 s</div>
</em><em class="status">passed</em>This is the text I want to get</span>
<span><em class="time">
<div class="time">1.50 s</div>
</em><em class="status">passed</em>This is the text I want to get too</span>'''
soup=BeautifulSoup(soup)
for e in soup.select('li.print.text span'):
print(e.find(text=True, recursive=False))
输出
This is the text I want to get
This is the text I want to get too