Python 中的解析帮助

Question

谁能帮我解析一下？我遇到了很多麻烦。我正在解析来自这个 site 的信息。

这里有几行代码从 table 中提取数据，其中包含 2 个标题和 4 个值：

for x in soup.findAll(attrs={'valign':'top'}):
                print(x.contents)
                make_list = x.contents
                print(make_list[1]) #trying to select one of the values on the list.

当我尝试用 make_list[1] 行打印它时，出现错误。但是，如果我拉出最后两行，我会得到我想要的列表格式的 html，但我似乎无法分开或过滤它们（取出 html标签）。有人可以帮忙吗？

这是输出示例，我想在此处详细说明。我不确定正确的正则表达式：

 ['\n', <td align="left">Western Mutual/Residence <a href="http://interactive.web.insurance.ca.gov/companyprofile/companyprofile?event=companyProfile&amp;doFunction=getCompanyProfile&amp;eid=3303"><small>(Info)</small></a></td>, '\n', <td align="left"><div align="right">           355</div></td>, '\n', <td align="left"><div align="right">250</div></td>, '\n', <td align="left"> </td>, '\n', <td align="left">Western Mutual/Residence <a href="http://interactive.web.insurance.ca.gov/companyprofile/companyprofile?event=companyProfile&amp;doFunction=getCompanyProfile&amp;eid=3303"><small>(Info)</small></a></td>, '\n', <td align="left"><div align="right">           320</div></td>, '\n', <td align="left"><div align="right">500</div></td>, '\n']

Answer 1

如果您正在尝试解析该网站的结果，则以下方法应该有效：

from bs4 import BeautifulSoup

html_doc = ....add your html....
soup = BeautifulSoup(html_doc, 'html.parser')
rows = []
tables = soup.find_all('table')
t2 = None

# Find the second from last table
for t3 in tables:
    t1, t2 = t2, t3

for row in t1.find_all('tr'):
    cols = row.find_all(['td', 'th'])
    cols = [col.text.strip() for col in cols]
    rows.append(cols)

# Collate the two columns
data = [cols[0:3] for cols in rows]
data.extend([cols[4:7] for cols in rows[1:]])

for row in data:
    print "{:40}  {:15} {}".format(row[0], row[1], row[2])

这使我的输出看起来像：

Company Name                              Annual Premium  Deductible
AAA (Interinsurance Exchange) (Info)      N/A             250
Allstate (Info)                           315             250
American Modern (Info)                    N/A             250
Amica Mutual (Info)                       259             250
Bankers Standard (Info)                   N/A             250
California Capital  (Info)                160             250
Century National (Info)                   N/A             250
.....

它是如何工作的？

由于网页主要是显示一个table，这就是我们要找的，所以第一步是获取table的列表。

网站的多个部分使用了 tables。页面结构有可能至少在请求之间保持不变。

我们需要的 table 几乎是页面上的最后一个（但不是最后一个）所以我决定遍历可用的 table 并选择倒数第二个。 t1 t2 t3 只是一种在迭代时保留最后值的变通方法。

从这里开始，HTML table 通常具有相当标准的结构，TR 和 TD。此行还使用 TH 作为 header 行。然后使用 table BeautifulSoup 允许我们枚举所有行。

对于每一行，我们都可以找到所有的列。如果打印返回的内容，您将看到每一行的所有条目，然后您可以看到需要哪些索引来将其切片。

他们在两个列组中显示了输出，中间有一个空白列。我构建了两个列表来提取两组列，然后将第二组附加到第一组的底部以进行显示。

Python 中的解析帮助

Parsing Help in Python

html

python

regex

parsing

beautifulsoup