如何在 python 中解析本地 html 文件时跳过第一个 table,并跳过第二个 table 头?
how to skip the first table, and skip the second table head during parsing a local html file in python?
我正在尝试解析本地 html 文件,我不知道为什么相同的代码在示例 html 文本和整个 html 文件之间产生不同的结果。谁能帮忙?我真的很感激。
示例 html 文本:
s = '''
<table width=90%>
<tr>
<td align="center" width=18%></td>
<td align="left" width=15%></td>
</tr>
</table>
<table border>
<tr>
<td nowrap="nowrap"><b>Rec</b></td>
<td align="RIGHT" nowrap="nowrap"><b>ID</b></td>
</tr>
<tr>
<td align="RIGHT" nowrap="nowrap" VALIGN=TOP>1</td>
<td nowrap="nowrap" VALIGN=TOP><a href="smthing?DID=ID">ID<br />100100</a></td>
</tr>
</table>
<p>
<style type="text/css">
.....
</style>
<table border>
<tr>
<td nowrap="nowrap"><b>Rec</b></td>
<td align="RIGHT" nowrap="nowrap"><b>ID</b></td>
</tr>
<tr>
<td align="RIGHT" nowrap="nowrap" VALIGN=TOP>2</td>
<td nowrap="nowrap" VALIGN=TOP><a href="smthing?DID=ID">ID<br />101101</a></td>
</tr>
</table>
'''
我试过以下方法:
''''
# with open('myfile.html', 'r', encoding='utf-8') as f: # when use the whole file
# s = f.read() # when use the whole file
soup = BeautifulSoup(s, "html.parser")
tables = [
[
[td.get_text(strip=True) for td in tr.find_all('td')]
for tr in table.find_all('tr')
]
for table in soup.find_all('table')
]
table_data = [i.text for i in soup.find_all('td')]
print(table_data)
''''
预期输出:
Rec ID
1 ID100100
2 ID101101
当前输出为:
['', '', 'Rec', 'ID', '1', 'ID100100', 'Rec', 'ID', '2', 'ID101101']
此外,当我用整个 HTML 文件实现相同的代码时,结果包含如下内容,我是否遗漏了这里的内容:
'', '</tr>', '', '</table>', '', '</table>', '', '</center>', '', '<hr />', '', '<center>', '',
您可以应用列表切片
from bs4 import BeautifulSoup
s = '''
<table width=90%>
<tr>
<td align="center" width=18%></td>
<td align="left" width=15%></td>
</tr>
</table>
<table border>
<tr>
<td nowrap="nowrap"><b>Rec</b></td>
<td align="RIGHT" nowrap="nowrap"><b>ID</b></td>
</tr>
<tr>
<td align="RIGHT" nowrap="nowrap" VALIGN=TOP>1</td>
<td nowrap="nowrap" VALIGN=TOP><a href="smthing?DID=ID">ID<br />100100</a></td>
</tr>
</table>
<p>
<style type="text/css">
.....
</style>
<table border>
<tr>
<td nowrap="nowrap"><b>Rec</b></td>
<td align="RIGHT" nowrap="nowrap"><b>ID</b></td>
</tr>
<tr>
<td align="RIGHT" nowrap="nowrap" VALIGN=TOP>2</td>
<td nowrap="nowrap" VALIGN=TOP><a href="smthing?DID=ID">ID<br />101101</a></td>
</tr>
</table>
'''
soup = BeautifulSoup(s, "html.parser")
table = soup.find_all('table')[2]
#print(len(table))
data=[]
table_data = [i.text for i in soup.find_all('td')]
rec=table_data[-3]
num_1= table_data[-5]
num_2= table_data[-1]
data.append([rec,num_1,num_2])
print(data)
输出:
[['ID', 'ID100100', 'ID101101']]
我正在尝试解析本地 html 文件,我不知道为什么相同的代码在示例 html 文本和整个 html 文件之间产生不同的结果。谁能帮忙?我真的很感激。 示例 html 文本:
s = '''
<table width=90%>
<tr>
<td align="center" width=18%></td>
<td align="left" width=15%></td>
</tr>
</table>
<table border>
<tr>
<td nowrap="nowrap"><b>Rec</b></td>
<td align="RIGHT" nowrap="nowrap"><b>ID</b></td>
</tr>
<tr>
<td align="RIGHT" nowrap="nowrap" VALIGN=TOP>1</td>
<td nowrap="nowrap" VALIGN=TOP><a href="smthing?DID=ID">ID<br />100100</a></td>
</tr>
</table>
<p>
<style type="text/css">
.....
</style>
<table border>
<tr>
<td nowrap="nowrap"><b>Rec</b></td>
<td align="RIGHT" nowrap="nowrap"><b>ID</b></td>
</tr>
<tr>
<td align="RIGHT" nowrap="nowrap" VALIGN=TOP>2</td>
<td nowrap="nowrap" VALIGN=TOP><a href="smthing?DID=ID">ID<br />101101</a></td>
</tr>
</table>
'''
我试过以下方法:
''''
# with open('myfile.html', 'r', encoding='utf-8') as f: # when use the whole file
# s = f.read() # when use the whole file
soup = BeautifulSoup(s, "html.parser")
tables = [
[
[td.get_text(strip=True) for td in tr.find_all('td')]
for tr in table.find_all('tr')
]
for table in soup.find_all('table')
]
table_data = [i.text for i in soup.find_all('td')]
print(table_data)
'''' 预期输出:
Rec ID
1 ID100100
2 ID101101
当前输出为:
['', '', 'Rec', 'ID', '1', 'ID100100', 'Rec', 'ID', '2', 'ID101101']
此外,当我用整个 HTML 文件实现相同的代码时,结果包含如下内容,我是否遗漏了这里的内容:
'', '</tr>', '', '</table>', '', '</table>', '', '</center>', '', '<hr />', '', '<center>', '',
您可以应用列表切片
from bs4 import BeautifulSoup
s = '''
<table width=90%>
<tr>
<td align="center" width=18%></td>
<td align="left" width=15%></td>
</tr>
</table>
<table border>
<tr>
<td nowrap="nowrap"><b>Rec</b></td>
<td align="RIGHT" nowrap="nowrap"><b>ID</b></td>
</tr>
<tr>
<td align="RIGHT" nowrap="nowrap" VALIGN=TOP>1</td>
<td nowrap="nowrap" VALIGN=TOP><a href="smthing?DID=ID">ID<br />100100</a></td>
</tr>
</table>
<p>
<style type="text/css">
.....
</style>
<table border>
<tr>
<td nowrap="nowrap"><b>Rec</b></td>
<td align="RIGHT" nowrap="nowrap"><b>ID</b></td>
</tr>
<tr>
<td align="RIGHT" nowrap="nowrap" VALIGN=TOP>2</td>
<td nowrap="nowrap" VALIGN=TOP><a href="smthing?DID=ID">ID<br />101101</a></td>
</tr>
</table>
'''
soup = BeautifulSoup(s, "html.parser")
table = soup.find_all('table')[2]
#print(len(table))
data=[]
table_data = [i.text for i in soup.find_all('td')]
rec=table_data[-3]
num_1= table_data[-5]
num_2= table_data[-1]
data.append([rec,num_1,num_2])
print(data)
输出:
[['ID', 'ID100100', 'ID101101']]