使用 Beautifulsoup 解析我的 table 是否有更 pythonic 的方式
Is there a more pythonic way of parsing my table using Beautifulsoup
我是 Python 的新手。我有一个 html 页面,其中 table 类似于下面的页面。我想以更整洁、更 pythonic 的方式解析和处理这些数据。
<table border="1">
<tr><td><b>Test Results</b></td><td><b>Log File</b></td><td><b>Passes</b></td><td><b>Fails</b></td></tr>
<tr><td><b>Test suite A</b></td><td><a href="A_logs.html">Logs</a></td><td><b>10</b></td><td><b>0</b></td></tr>
<tr><td><b>Test suite B</b></td><td><a href="B_logs.html">Logs</a></td><td><b>20</b></td><td><b>0</b></td></tr>
<tr><td><b>Test suite C</b></td><td><a href="C_logs.html">Logs</a></td><td><b>15</b></td><td><b>0</b></td></tr>
</table>
使用 BeautifulSoup 我在 table 中进行了解析。
results_table = tables[0] # This will get the first table on the page.
table_rows = my_table.findChildren(['th','tr'])
for i in table_rows:
text = str(i)
print( "All rows:: {0}\n".format(text))
if "Test suite A" in text:
print( "Test Suite: {0}".format(text))
# strip out html characters
list = str(BeautifulSoup(text).findAll( text = True ))
# strip out any further stray characters such as [,]
list = re.sub("[\'\[\]]", "", list)
list = list.split(',') # split my list entries by comma
print("Test: {0}".format(str(list[0])))
print("Logs: {0}".format(str(list[1])))
print("Pass: {0}".format(str(list[3])))
print("Fail: {0}".format(str(list[4])))
这就是我的代码,它可以完成我想要的一切。我只是想知道是否有更 pythonic 的方式来做到这一点。忽略打印语句,因为我打算将其放入自己的方法中,传递结果 table 和 return 通过、失败、记录、测试。
所以..
def parseHtml(results_table)
# split out all rows in my table into a list
table_rows = my_table.findChildren(['th','tr'])
for i in table_rows:
text = str(i)
if "Test suite A" in text:
# strip out html characters
list = str(BeautifulSoup(text).findAll( text = True ))
# strip out any further stray characters such as [,]
list = re.sub("[\'\[\]]", "", list)
# split my list entries by comma
list = list.split(',')
return (list[0],list[1],list[3],list[4])
在这种情况下,我倾向于遍历 'tr' 然后 'td'
bs_table = BeautifulSoup(my_table)
ls_rows = []
for ls_tr in bs_table.findAll('tr'):
ls_rows.append([td_bloc.text for td_bloc in ls_tr.findAll('td')])
html="""<table border="1">
<tr><td><b>Test Results</b></td><td><b>Log File</b></td><td><b>Passes</b></td><td><b>Fails</b></td></tr>
<tr><td><b>Test suite A</b></td><td><a href="A_logs.html">Logs</a></td><td><b>10</b></td><td><b>0</b></td></tr>
<tr><td><b>Test suite B</b></td><td><a href="B_logs.html">Logs</a></td><td><b>20</b></td><td><b>0</b></td></tr>
<tr><td><b>Test suite C</b></td><td><a href="C_logs.html">Logs</a></td><td><b>15</b></td><td><b>0</b></td></tr>
</table>"""
from bs4 import BeautifulSoup
soup = BeautifulSoup(html)
data = soup.find_all("b") # this will be your table
# ignore Test Result etc.. and get Test suite A ... from each row
data = (data[4:][i:i+3] for i in range(0, len(data[4:]),3))
# get all log file names
logs = iter(x["href"] for x in soup.find_all("a",href=True))
# unpack each subelement and print the tag text
for a, b, c in data:
print("Test: {}, Log: {}, Pass: {}, Fail: {}".format(a.text ,next(logs),b.text, c.text))
Test: Test suite A, Log: A_logs.html, Pass: 10, Fail: 0
Test: Test suite B, Log: B_logs.html, Pass: 20, Fail: 0
Test: Test suite C, Log: C_logs.html, Pass: 15, Fail: 0
不要使用 list
作为变量名,因为它会遮盖 python list
,如果您想从 find_all 调用的子列表中获取元素迭代或索引,不要使用 re.
我是 Python 的新手。我有一个 html 页面,其中 table 类似于下面的页面。我想以更整洁、更 pythonic 的方式解析和处理这些数据。
<table border="1">
<tr><td><b>Test Results</b></td><td><b>Log File</b></td><td><b>Passes</b></td><td><b>Fails</b></td></tr>
<tr><td><b>Test suite A</b></td><td><a href="A_logs.html">Logs</a></td><td><b>10</b></td><td><b>0</b></td></tr>
<tr><td><b>Test suite B</b></td><td><a href="B_logs.html">Logs</a></td><td><b>20</b></td><td><b>0</b></td></tr>
<tr><td><b>Test suite C</b></td><td><a href="C_logs.html">Logs</a></td><td><b>15</b></td><td><b>0</b></td></tr>
</table>
使用 BeautifulSoup 我在 table 中进行了解析。
results_table = tables[0] # This will get the first table on the page.
table_rows = my_table.findChildren(['th','tr'])
for i in table_rows:
text = str(i)
print( "All rows:: {0}\n".format(text))
if "Test suite A" in text:
print( "Test Suite: {0}".format(text))
# strip out html characters
list = str(BeautifulSoup(text).findAll( text = True ))
# strip out any further stray characters such as [,]
list = re.sub("[\'\[\]]", "", list)
list = list.split(',') # split my list entries by comma
print("Test: {0}".format(str(list[0])))
print("Logs: {0}".format(str(list[1])))
print("Pass: {0}".format(str(list[3])))
print("Fail: {0}".format(str(list[4])))
这就是我的代码,它可以完成我想要的一切。我只是想知道是否有更 pythonic 的方式来做到这一点。忽略打印语句,因为我打算将其放入自己的方法中,传递结果 table 和 return 通过、失败、记录、测试。
所以..
def parseHtml(results_table)
# split out all rows in my table into a list
table_rows = my_table.findChildren(['th','tr'])
for i in table_rows:
text = str(i)
if "Test suite A" in text:
# strip out html characters
list = str(BeautifulSoup(text).findAll( text = True ))
# strip out any further stray characters such as [,]
list = re.sub("[\'\[\]]", "", list)
# split my list entries by comma
list = list.split(',')
return (list[0],list[1],list[3],list[4])
在这种情况下,我倾向于遍历 'tr' 然后 'td'
bs_table = BeautifulSoup(my_table)
ls_rows = []
for ls_tr in bs_table.findAll('tr'):
ls_rows.append([td_bloc.text for td_bloc in ls_tr.findAll('td')])
html="""<table border="1">
<tr><td><b>Test Results</b></td><td><b>Log File</b></td><td><b>Passes</b></td><td><b>Fails</b></td></tr>
<tr><td><b>Test suite A</b></td><td><a href="A_logs.html">Logs</a></td><td><b>10</b></td><td><b>0</b></td></tr>
<tr><td><b>Test suite B</b></td><td><a href="B_logs.html">Logs</a></td><td><b>20</b></td><td><b>0</b></td></tr>
<tr><td><b>Test suite C</b></td><td><a href="C_logs.html">Logs</a></td><td><b>15</b></td><td><b>0</b></td></tr>
</table>"""
from bs4 import BeautifulSoup
soup = BeautifulSoup(html)
data = soup.find_all("b") # this will be your table
# ignore Test Result etc.. and get Test suite A ... from each row
data = (data[4:][i:i+3] for i in range(0, len(data[4:]),3))
# get all log file names
logs = iter(x["href"] for x in soup.find_all("a",href=True))
# unpack each subelement and print the tag text
for a, b, c in data:
print("Test: {}, Log: {}, Pass: {}, Fail: {}".format(a.text ,next(logs),b.text, c.text))
Test: Test suite A, Log: A_logs.html, Pass: 10, Fail: 0
Test: Test suite B, Log: B_logs.html, Pass: 20, Fail: 0
Test: Test suite C, Log: C_logs.html, Pass: 15, Fail: 0
不要使用 list
作为变量名,因为它会遮盖 python list
,如果您想从 find_all 调用的子列表中获取元素迭代或索引,不要使用 re.