将 html table 转换为带有 beautifulsoup 或 lxml 的字典？

Question

我正在尝试将几个 html 表转换为字典，但无法正常工作，数据如下。'Running' 列每行的链接数量不同。

我只关心标题、姓名和运行列。

我的最终目标是一个包含多个词典的列表。我已经为此苦苦思索了一段时间，却什么也做不了

[{Title:'Randomnamehere1',Name:'Bob Dylan1',Running:[href, href, href]}, {Title:'Randomnamehere2',Name:'Bob Dylan2',Running:[href, href, href]}, {Title:'Randomnamehere3',Name:'Bob Dylan3',Running:[href, href, href]}]

    <div class="span12">
      <table id="tests" class="responsive table pdf-table drop-row tablesorter" style="width:100%;margin-bottom:0;">
        <thead>
          <th>Title</th>
          <th>Name</th>
          <th>Group</th>
          <th>Time</th>
          <th>Running</th>
          <th>Instructor Actions</th>
        </thead>
        <tbody>
          <tr>
            <td>
              <a href="/reports/532809">Randomnamehere</a>
            </td>
            <td>
              Bob Dylan (bd@letsgo.com)
            </td>
            <td>
              Group1
            </td>
            <td>
              01:54:20
            </td>
            <td>
              <ul style="list-style-type:none">
                PWS(s)
                  <li>
                  <a href="/user_section_items/532809/" target="_blank">local</a>
                  </li>
                  Mod_X010_C008
                    <li>
                    <a href="/user_section_items/532809/" target="_blank">lab:SC</a>
                    </li>
                    <li>
                    <a href="/user_section_items/532809/" target="_blank">NIX</a>
                    </li>
              </ul>
            </td>

这是我到目前为止得到的...

from lxml import html, etree
from bs4 import BeautifulSoup as bs

source = html.parse('source.html')
table = [c.text for c in source.xpath('//div[@class="span12"]//tbody//td/*//*')]

Answer 1

循环 table 行，忽略 header 行，并在循环内生成每个字典。将这些附加到全局列表以获得您想要的结果。您可以使用 :nth-of-type 来区分列。如果是第一列，直接用select_one先匹配td即可；列表理解可用于提取最终输出列的 href 属性。

from bs4 import BeautifulSoup as bs

html = '''your html'''
soup = bs(html, 'lxml')
results = []

for row in soup.select('#tests tr')[1:]:
    results.append(
    {
     'Title': row.select_one('td').text,
     'Name': row.select_one('td:nth-of-type(2)').text,
     'Running':[i['href'] for i in row.select('td:nth-of-type(5) a')]
    })

print(results)

将 html table 转换为带有 beautifulsoup 或 lxml 的字典？

Convert html table to dict w/ beautifulsoup or lxml?

html

python

lxml

beautifulsoup

web-scraping