使用 bs4 从 table 中提取信息，除了 table 的 header

Question

我正在尝试使用 bs4 和 python 从 table 中提取信息。当我使用以下代码从 table 的 header 中提取信息时：

    tr_header=table.findAll("tr")[0]
    tds_in_header = [td.get_text()  for td in tr_header.findAll("td")]
    header_items= [data.encode('utf-8')  for data in tds_in_header]
    len_table_header = len (header_items)

它有效，但对于以下代码，我试图从第一行提取信息到 table 的末尾：

    tr_all=table.findAll("tr")[1:]
    tds_all = [td.get_text()  for td in tr_all.findAll("td")]
    table_info= [data.encode('utf-8')  for data in tds_all]

出现以下错误：

AttributeError: 'list' object has no attribute 'findAll'

谁能帮我编辑一下。

这是table信息：

    <table class="codes"><tr><td><b>Code</b>
</td><td><b>Display</b></td><td><b>Definition</b></td>
</tr><tr><td>active<a name="active"> </a></td>
<td>Active</td><td>This account is active and may be used.</td></tr>
<tr><td>inactive<a name="inactive"> </a></td>
<td>Inactive</td><td>This account is inactive
 and should not be used to track financial information.</td></tr></table>

这是 tr_all 的输出：

[<tr><td><b>Code</b></td><td><b>Display</b></td><td><b>Definition</b></td></tr>, <tr><td>active<a name="active"> </a></td><td>Active</td><td>This account is active and may be used.</td></tr>, <tr><td>inactive<a name="inactive"> </a></td><td>Inactive</td><td>This account is inactive and should not be used to track financial information.</td></tr>]

Answer 1

对于你的第一个问题，

import bs4

text = """
<table class="codes"><tr><td><b>Code</b>
</td><td><b>Display</b></td><td><b>Definition</b></td>
</tr><tr><td>active<a name="active"> </a></td>
<td>Active</td><td>This account is active and may be used.</td></tr>
<tr><td>inactive<a name="inactive"> </a></td>
<td>Inactive</td><td>This account is inactive
 and should not be used to track financial information.</td></tr></table>"""

table = bs4.BeautifulSoup(text)
tr_all = table.findAll("tr")[1:]
tds_all = []
for tr in tr_all:
    tds_all.append([td.get_text() for td in tr.findAll("td")])
    # if You prefer double list comprefension instead...
table_info = [data[i].encode('utf-8') for data in tds_all
                                      for i in range(len(tds_all))]
print(table_info)

产量

['active ', 'Active', 'inactive ', 'Inactive']

关于你的第二个问题

tr_header=table.findAll("tr")[0] i do not get a list

是的，[]是索引操作，它从列表中选择第一个元素，因此您得到单个元素。 [1:] 是切片运算符（如果您需要更多信息，请查看 nice tutorial）。

实际上，对于 table.findAll("tr") 的每次调用，您会得到两次列表 - 对于 header 和其余行。当然，这是非常多余的。如果您想将令牌与 header 分开并休息，我想您可能想要这样的东西

tr_all = table.findAll("tr")
header = tr_all[0]
tr_rest = tr_all[1:] 
tds_rest = []
header_data = [td.get_text().encode('utf-8') for td in header]

for tr in tr_rest:
     tds_rest.append([td.get_text() for td in tr.findAll("td")])

关于第三个问题

Is it possible to edit this code to add table information from the first row to the end of the table?

在下面的评论中给出您想要的输出：

rows_all = table.findAll("tr")
header = rows_all[0]
rows = rows_all[1:]

data = []
for row in rows:
    for td in row:
        try:
            data.append(td.get_text())
        except AttributeError:
            continue
print(data)

# or more or less same as above, oneline
data = [td.get_text() for row in rows for td in row.findAll("td")]

产量

[u'active', u'Active', u'This account is active and may be used.', u'inactive', u'Inactive', u'This account is inactive and should not be used to track financial information.']

Answer 2

JustMe 正确回答了这个问题。另一个等效变体是：

import bs4

text = """
<table class="codes"><tr><td><b>Code</b>
</td><td><b>Display</b></td><td><b>Definition</b></td>
</tr><tr><td>active<a name="active"> </a></td>
<td>Active</td><td>This account is active and may be used.</td></tr>
<tr><td>inactive<a name="inactive"> </a></td>
<td>Inactive</td><td>This account is inactive
 and should not be used to track financial information.</td></tr></table>"""

table = bs4.BeautifulSoup(text)
tr_all = table.findAll("tr")[1:]
# critical line:
tds_all = [ td.get_text() for each_tr in tr_all for td in each_tr.findAll("td")]
# and after that unchanged:
table_info= [data.encode('utf-8')  for data in tds_all]

# for control:
print(table_info)

关键行中的这种奇怪结构起到了列表'tds_all'列表的展平作用。 lambda z: [x for y in z for x in y] 展平列表 z 的列表。我根据这个具体情况替换了x和y和z。

实际上我已经做到了，因为我有一个中间步骤作为关键线： tds_all = [[td.get_text() for td in each_tr.findAll("td")] for each_tr in tr_all ] 为 tds_all 生成列表列表： [[u'active', u'Active', u'这个账号是活跃的，可以使用。'], [u'inactive', u'Inactive', u'这个账号是inactive\n，不应该用于跟踪财务信息。']] 为了使它变平，需要这个 [x for y in z for x in y] 组合。但后来我想，为什么不把这个结构直接应用到临界线上，然后把它压扁呢？

z 是 bs4 对象列表 (tr_all)。在这个 'for ... in ...'-construct 中，each_tr（一个 bs4-object）是从列表 'tr_all' 中取出的，each_tr 对象在后面的 'for-in'- 中生成通过表达式 each_tr.findAll("td") 构造所有 'td' 匹配项的列表，每个匹配项 "td" 都被 'for ... in ...' 循环后面的这个隔离，并且在这个 listexpession 的最开始是应该在最终列表中收集的内容：从此对象中分离出来的文本 ("td.get_text()")。并将此生成的最终列表分配给 td_all.

这段代码的结果是这个结果列表：

['active ', 'Active', 'This account is active and may be used.', 'inactive ', 'Inactive', 'This account is inactive\n and should not be used to track financial information.']

JustMe 的示例中缺少两个较长的元素。我想，玛丽，你想把它们包括在内，是吗？

使用 bs4 从 table 中提取信息，除了 table 的 header

Extracting information from a table except header of the table using bs4

html

python

bs4