使用 BeautifulSoup 和 Python 来解析页面的属性值

Question

我正在尝试使用 Python 和 BeautifulSoup 浏览一个页面，该页面的部分 ID 值递增 1，我正在尝试获取他们的视频。然而，# of vids 是可变的，具体取决于 span id，如下所示，它也没有嵌套在原始 tr 下。

现在我正在做一个循环来获取 span id 值，但是我正在尝试找出一种方法来获取 vid 值作为每个 span id 的数组。

以下是我正在使用的示例html：

<tr>
    <td>
        <div>
            <span class="apple-font" id="001">
        </div>
    </td>
</tr>

<tr>
</tr>

<tr>
    <td>
        <a vid="0099882"></a>
    </td>
</tr>

<tr>
    <td>
        <a vid="0099883"></a>
    </td>
</tr>

<tr>
    <td>
        <a vid="0099883"></a>
    </td>
</tr>


<tr>
    <td>
        <div>
            <span class="apple-font" id="002">
        </div>
    </td>
</tr>

<tr>
</tr>

<tr>
    <td>
        <a vid="0099883"></a>
    </td>
</tr>

<tr>
    <td>
        <div>
            <span class="apple-font" id="003">
        </div>
    </td>
</tr>

<tr>
</tr>

<tr>
    <td>
        <a vid="0099883"></a>
    </td>
</tr>

<tr>
    <td>
        <a vid="0099883"></a>
    </td>
</tr>

<tr>
    <td>
        <div>
            <span class="apple-font" id="004">
        </div>
    </td>
</tr>

<tr>
</tr>

以下是我正在使用/一直在尝试但在弄清楚获取所有视频方面还没有取得太大进展的代码：

soup = soup.findAll(class_="apple-font", id=True)
for s in soup:       
   n = str(s.get_text().lstrip().replace(".",""))
   print n
print

Answer 1

我会使用迭代方法；从第一个 <span class="apple-font"> 标记开始循环遍历同一 table 中的所有 tr 元素，并在每次找到包含新 id 的行时开始一个新组：

table = soup.find(class_='apple-font', id=True).find_parent('table')
groups = {}
group = None
for tr in table.find_all('tr'):
    id_span = tr.find(class_='apple-font', id=True)
    if id_span is not None:
        # new group
        group = []
        groups[id_span['id']] = group
    else:
        vid_link = tr.find('a', vid=True)
        if vid_link is not None:
            group.append(vid_link['vid'])

演示：

>>> from bs4 import BeautifulSoup
>>> sample = '''\
... <tr>
...     <td>
...         <div>
...             <span class="apple-font" id="001">
...         </div>
...     </td>
... </tr>
... 
... <tr>
... </tr>
... 
... <tr>
...     <td>
...         <a vid="0099882"></a>
...     </td>
... </tr>
... 
... <tr>
...     <td>
...         <a vid="0099883"></a>
...     </td>
... </tr>
... 
... <tr>
...     <td>
...         <a vid="0099883"></a>
...     </td>
... </tr>
... 
... 
... <tr>
...     <td>
...         <div>
...             <span class="apple-font" id="002">
...         </div>
...     </td>
... </tr>
... 
... <tr>
... </tr>
... 
... <tr>
...     <td>
...         <a vid="0099883"></a>
...     </td>
... </tr>
... 
... <tr>
...     <td>
...         <div>
...             <span class="apple-font" id="003">
...         </div>
...     </td>
... </tr>
... 
... <tr>
... </tr>
... 
... <tr>
...     <td>
...         <a vid="0099883"></a>
...     </td>
... </tr>
... 
... <tr>
...     <td>
...         <a vid="0099883"></a>
...     </td>
... </tr>
... 
... <tr>
...     <td>
...         <div>
...             <span class="apple-font" id="004">
...         </div>
...     </td>
... </tr>
... 
... <tr>
... </tr>
... '''
>>> soup = BeautifulSoup('<table>{}</table>'.format(sample))
>>> table = soup.find(class_='apple-font', id=True).find_parent('table')
>>> groups = {}
>>> group = None
>>> for tr in table.find_all('tr'):
...     id_span = tr.find(class_='apple-font', id=True)
...     if id_span is not None:
...         # new group
...         group = []
...         groups[id_span['id']] = group
...     else:
...         vid_link = tr.find('a', vid=True)
...         if vid_link is not None:
...             group.append(vid_link['vid'])
... 
>>> print groups
{'003': ['0099883', '0099883'], '002': ['0099883'], '001': ['0099882', '0099883', '0099883'], '004': []}

使用 BeautifulSoup 和 Python 来解析页面的属性值

Using BeautifulSoup with Python to parse page for attribute values

html

python

beautifulsoup