如何使用 Python lxml & xpath 解析此 html 以找到特定跨度 ID 的父级 table？

Question

这是 HTML 我无法控制的。这是真实页面的浓缩HTML。

<!DOCTYPE html>
<html lang="en">
<head>
<meta charset="UTF-8">
<title>Little League</title>
</head>
<body>
<table>
<span>lot of unrelated text</span>
</table>
<table>
<span>lot of unrelated text</span>
</table>
<table>
<span>lot of unrelated text</span>
</table>
<table>
<tbody>
<tr>
<td class="rightTD">
<p>
<span id="teams_players">Player Teams</span>
</p>
</td>
</tr>
<tr>
<td>
<table border="1" cellspacing="0" cellpadding="0" class="tableBorder table table-bordered" width="100%">
<tbody>
<tr>
<td>
<table border="0" width="100%" class="tableData">
<tbody>
<tr id="team_listings">
<td colspan="3">Team Listings
<br>
<br>
</td>
</tr>
<tr>
<td>(a) </td>
<td colspan="2">Team Name </td>
</tr>
<tr>
<td></td>
<td colspan="2">
<span class="blue_color">Foxes</span>
</td>
</tr>
<tr>
<td>(b) </td>
<td colspan="2">Team Rank</td>
</tr>
<tr>
<td></td>
<td colspan="2">
<span class="blue_color">1</span>
</td>
</tr>
<tr>
<td>(c) </td>
<td colspan="2">Team Location
</td>
</tr>
<tr>
<td></td>
<td colspan="2">
<table width="100%">
<tbody>
<tr>
<td>City:
<br>
<span class="blue_color">Tualatin</span>
</td>
<td>State:
<br>
<span class="blue_colorLined"></span>
<br>
<span class="blue_color">Oregon</span>
</td>
<td>Country:
<br>
<span class="blue_colorLined"></span>
<br>
<span class="blue_color">United States</span>
</td>
</tr>
</tbody>
</table>
</td>
</tr>

</tbody>
</table>
</td>
</tr>
</tbody>
</table>
<br>
<table border="1" cellspacing="0" cellpadding="0" class="tableBorder table table-bordered" width="100%">
<tbody>
<tr>
<td>
<table border="0" width="100%" class="tableData">
<tbody>
<tr>
<td>(a) </td>
<td colspan="2">Team Name </td>
</tr>
<tr>
<td></td>
<td colspan="2">
<span class="blue_color">Tigers</span>
</td>
</tr>
<tr>
<td>(b) </td>
<td colspan="2">Team Rank</td>
</tr>
<tr>
<td></td>
<td colspan="2">
<span class="blue_color">3</span>
</td>
</tr>
<tr>
<td>(c) </td>
<td colspan="2">Team Location
</td>
</tr>
<tr>
<td></td>
<td colspan="2">
<table width="100%">
<tbody>
<tr>
<td>City:
<br>
<span class="blue_color">Tigard</span>
</td>
<td>State:
<br>
<span class="blue_colorLined"></span>
<br>
<span class="blue_color">Oregon</span>
</td>
<td>Country:
<br>
<span class="blue_colorLined"></span>
<br>
<span class="blue_color">United States</span>
</td>
</tr>
</tbody>
</table>
</td>
</tr>

</tbody>
</table>
</td>
</tr>
</tbody>
</table>
</td>
</tr>
</tbody>
</table>
</body>
</html>

我正在尝试访问 table 标签 紧接在 带有 id team_players 的 span 标签之前.

我尝试了这些但失败了 -

//table/span[@id="teams_players"]
ancestor::table[span[@id="teams_players"][position() = 1]]

这可行但不够优雅，我不想对其进行硬编码 -

//span[@id="teams_players"]/../../../../..

虽然 //table[@class="tableData"] 这看起来应该有效，但 HTML 中有许多这样的 table 具有相同的 class 和不相关的数据。所以这个排除了。

这是我迄今为止尝试的代码（绝对没有效率，一旦我找到一种获取两个 table 的方法，我计划循环遍历它们以提取数据 -

def parse_team():

    # team data structure
    teams = []
    team_dict = { 'team': '', 'rank': '', 'location': { 'city': '', 'state': '', 'country': '' } }

    filename = f'team.html'
    f = open(filename, encoding="utf8").read()
    parser = etree.HTMLParser()
    tree = etree.parse(StringIO(f), parser)

    # fetch the table dom and parse each team table
    # fetch the parent table that contains teams_players span id
    team_tables = tree.xpath('ancestor::table[span[@id="teams_players"][position() = 1]]')
    print(team_tables)

    root_tables = tree.xpath('//table/span[@id="teams_players"]')
    print("root tables", root_tables)

    # this provides each team table but in full html, the same class is being used for other unrelated data
    name = tree.xpath('//table[@class="tableData"]')
    print(name)

    eachvaltr = name[0].xpath('.//tr')
    teamname = name[0].xpath('.//td[contains(text(),"Team Name")]//parent::tr/following-sibling::tr[1]//span[@class="blue_color"]/text()')
    print("teamname", teamname)
    teamrank = name[0].xpath(
        './/td[contains(text(),"Team Rank")]//parent::tr/following-sibling::tr[1]//span[@class="blue_color"]/text()')
    print("teamrank", teamrank)
    city = name[0].xpath(
        './/td[contains(text(),"City")]//span[@class="blue_color"]/text()')
    state = name[0].xpath(
        './/td[contains(text(),"State")]//span[@class="blue_color"]/text()')
    country = name[0].xpath(
        './/td[contains(text(),"Country")]//span[@class="blue_color"]/text()')
    print(city[0], state[0], country[0])
    team_dict['team'] = teamname
    team_dict['rank'] = teamrank
    team_dict['location']['city'] = city[0]
    team_dict['location']['state'] = state[0]
    team_dict['location']['country'] = country[0]

    print(team_dict)

期望的输出是一个团队列表，其中每个团队都是一个字典。

[{'team': ['Foxes'], 'rank': ['1'], 'location': {'city': 'Tualatin', 'state': 'Oregon', 'country': 'United States'}}]

Answer 1

//table[.//span[@id="teams_players"]]

或

//span[@id="teams_players"]/ancestor::table

如何使用 Python lxml & xpath 解析此 html 以找到特定跨度 ID 的父级 table？

How do I parse this html with Python lxml & xpath that finds the parent table of a specific span id?

xpath

lxml

python-3.x