BeautifulSoup

Question

我HTML喜欢：

<tr>
    <td>Title:</td>
    <td>Title value</td>
</tr>

我必须指定在哪个 <td> 之后我想获取第二个 <td> 的文本。类似于：在 <td> 之后抓取第一个下一个 <td> 的文本，其中包含文本 Title:。结果应该是：Title value

我对 Python 和 BeutifulSoupno 有一些基本的了解，但我不知道在没有 class 指定的情况下我该怎么做。

我试过这个：

row =  soup.find_all('td', string='Title:')
text = str(row.nextSibling)
print(text)

我收到错误：AttributeError: 'ResultSet' object has no attribute 'nextSibling'

Answer 1

对于使用 xpath 的 lxml，您打算做的事情相对容易一些。你可以试试这样的，

from lxml import etree
tree = etree.parse(<your file>)
path_list = tree.xpath('//<xpath to td>')
for i in range(0, len(path_list)) :
    if path_list[i].text == '<What you want>' and i != len(path_list) :
        your_text = path_list[i+1].text

Answer 2

首先，soup.find_all() returns 一个 ResultSet 包含所有带有标签 td 和字符串为 Title: 的元素。

对于结果集中的每个这样的元素，您将需要单独获取 nextSibling（另外，您应该循环直到找到标签 td 的 nextSibling，因为您可以在两者之间获取其他元素（像 NavigableString））。

例子-

>>> from bs4 import BeautifulSoup
>>> s="""<tr>
...     <td>Title:</td>
...     <td>Title value</td>
... </tr>"""
>>> soup = BeautifulSoup(s,'html.parser')
>>> row =  soup.find_all('td', string='Title:')
>>> for r in row:
...     nextSib = r.nextSibling
...     while nextSib.name != 'td' and nextSib is not None:
...             nextSib = nextSib.nextSibling
...     print(nextSib.text)
...
Title value

或者您可以使用另一个支持 XPATH 的库，使用 Xpath 可以轻松做到这一点。其他库，如 - lxml 或 xml.etree .

BeautifulSoup - 如何提取指定字符串后的文本

BeautifulSoup - How to extract text after specified string

python

extract

python-3.x