如何解析 HTML 并使用 Python 获取 table id

Question

我正在寻找解析 html 并使用 python 获取 table 个 ID 的列表。

我有一个 HTML 文档，格式如下，包含多个 table：

我正在尝试抓取并获取 table 个 ID 的页面 - https://docs.aws.amazon.com/workspaces/latest/adminguide/workspaces-port-requirements.html

<html>
<div class="table-container">
  <div class="table-contents disable-scroll">
     <table id="w345aab9c13c11b5"> # this is table id for below table name
        <thead>
           <tr>
              <th class="table-header" colspan="100%">
                 <div class="title">Domains and IP addresses to add to your allow list</div> # I need to look for this table name and get the table id associated with it
              </th>
           </tr>
        </thead>
        <tbody>
          ...
     </tbody>
    </table>
  </div>
</div>
<div class="table-container">
  <div class="table-contents disable-scroll">
     <table id="w345aab9c13c13b2">
        <thead>
           <tr>
              <th class="table-header" colspan="100%">
                 <div class="title">Domains and IP Addresses to Add to Your Allow List for PCoIP</div>
              </th>
           </tr>
           <tr>
          ...
           </tr>
        </thead>
        <tbody>
          ...
     </tbody>
    </table>
  </div>
</div>
...
</html>

我需要检查 div 标签中的匹配值并获取与之关联的 table id

我是 python 的新手，任何关于如何解决这个问题或解决方案的建议都会很有帮助。

Answer 1

您可以使用 BeautifulSoup 获取 ID：

import requests
from bs4 import BeautifulSoup

url = 'http://docs.aws.amazon.com/workspaces/latest/adminguide/workspaces-port-requirements.html'

resp = requests.get(url)

soup = BeautifulSoup(resp.content, 'html.parser')

for t in soup.select('table[id]'):
    if 'Domains and IP Addresses to Add to Your Allow List' in t.getText():
        print(t.attrs['id'])

我相信您能想出如何将其合并到您的代码中。

如何解析 HTML 并使用 Python 获取 table id

How to parse HTML and get table ids using Python

python

html-parsing

dataframe

web-scraping

python-3.x