额外的 HTML 标签导致 bs4 出现问题

Question

我正在尝试从站点 http://www.house.gov/representatives/ 上的 table 获取一些信息具体来说，我想获得 "Representative Directory By Last Name" table 代表的信息。到目前为止，我可以从站点下载 HTML 并将其写入文件，但是当使用 bs4 解析和抓取我想要的特定 tables 时，它只抓取第一行每个 table.

这是因为 HTML table:

的每一行都有一个额外的标签

<tr>
<td><a href="https://abraham.house.gov/">
Abraham, Ralph  </a>
</td>
<td>Louisiana 5th District</td>
<td>R</td>
<td>417 CHOB</td>
<td>202-225-8490</td>
<td>Agriculture<BR>Armed Services<BR>Science, Space, and Technology</td>
</td>
</tr>

最后一个 /td 标记以某种方式导致 bs4 无法抓取其余行。我确实进行了手动测试并删除了一些额外的标签，然后我取回了所有行，所以我知道额外的标签是问题所在。到目前为止，这是我的 python 代码：

import bs4, requests

res = requests.get('http://www.house.gov/representatives/')
res.raise_for_status()
file = open('HouseReps.html', 'wb')
for chunk in res.iter_content(100000):
    file.write(chunk)
file = open('HouseReps.html')
soup = bs4.BeautifulSoup(file, 'html.parser')
table = soup.select('table[title="Representative Directory By Last Name"]')
print(table)

我也尝试过使用 prettify() 但这也没有帮助。关于如何清理 HTML 以便我可以使用 bs4（或其他东西）解析和提取我需要的 table 的任何想法？

谢谢！

Answer 1

您可以在代码中使用 lxml 解析器而不是 html.parser :

import bs4, requests

res = requests.get('http://www.house.gov/representatives/')
res.raise_for_status()
file = open('HouseReps.html', 'wb')
for chunk in res.iter_content(100000):
    file.write(chunk)
file = open('HouseReps.html')
soup = bs4.BeautifulSoup(file, 'lxml') #use the `lxml` parser instead of `html.parser`
table = soup.findAll("table",{"title":"Representative Directory By Last Name"})
print(table[0]) #print first table

输出将显示完整的第一个 table 和 "title" = "Representative Directory By Last Name":

<table class="directory" title="Representative Directory By Last Name">
<colgroup>
<col class="name"></col>
<col class="dist2"></col>
<col class="part"></col>
<col class="room"></col>
<col class="phone2"></col>
<col class="comm2"></col>
</colgroup>
<thead>
<tr>
<th>Name</th>
<th>District</th>
<th>Party</th>
<th>Room</th>
<th>Phone</th>
<th>Committee Assignment</th>
</tr>
</thead>
<tbody>
<tr>
<td><a href="https://abraham.house.gov/">
Abraham, Ralph  </a>
</td>
<td>Louisiana 5th District</td>
<td>R</td>
<td>417 CHOB</td>
<td>202-225-8490</td>
<td>Agriculture<br/>Armed Services<br/>Science, Space, and Technology</td>
</tr>
<tr>
<td><a href="http://adams.house.gov">
Adams, Alma </a>
</td>
<td>North Carolina 12th District</td>
<td>D</td>
<td>222 CHOB</td>
<td>202-225-1510</td>
<td>Agriculture<br/>Education and the Workforce<br/>Small Business</td>
</tr>
<tr>
<td><a href="https://aderholt.house.gov/">
Aderholt, Robert </a>
</td>
<td>Alabama 4th District</td>
<td>R</td>
<td>235 CHOB</td>
<td>202-225-4876</td>
<td>Appropriations</td>
</tr>
<tr>
<td><a href="https://aguilar.house.gov/">
Aguilar, Pete </a>
</td>
<td>California 31st District</td>
<td>D</td>
<td>1223 LHOB</td>
<td>202-225-3201</td>
<td>Appropriations</td>
</tr>
<tr>
<td><a href="http://allen.house.gov">
Allen, Rick </a>
</td>
<td>Georgia 12th District</td>
<td>R</td>
<td>426 CHOB</td>
<td>202-225-2823</td>
<td>Agriculture<br/>Education and the Workforce</td>
</tr>
<tr>
<td><a href="https://amash.house.gov/">
Amash, Justin </a>
</td>
<td>Michigan 3rd District</td>
<td>R</td>
<td>114 CHOB</td>
<td>202-225-3831</td>
<td>Oversight and Government</td>
</tr>
<tr>
<td><a href="https://amodei.house.gov">
Amodei, Mark </a>
</td>
<td>Nevada 2nd District</td>
<td>R</td>
<td>332 CHOB</td>
<td>202-225-6155</td>
<td>Appropriations</td>
</tr>
<tr>
<td><a href="https://arrington.house.gov">
Arrington, Jodey  </a>
</td>
<td>Texas 19th District</td>
<td>R</td>
<td>1029 LHOB</td>
<td>202-225-4005</td>
<td>Agriculture<br/>the Budget<br/>Veterans' Affairs</td>
</tr>
</tbody>
</table>

额外的 HTML 标签导致 bs4 出现问题

Extra HTML tag causing problems with bs4

html

python

bs4