我如何用 BeautifulSoup4 解析这个 HTML?
How can I parse this HTML with BeautifulSoup4?
我想同时获取日期和状态(在“Fecha”和“Estado”下)。 table
中可能有更多的 td 标签
<body link="#000000" vlink="#000000" alink="#000000" leftmargin="15" topmargin="0" marginwidth="0" marginheight="0" bgcolor="#FFFFFF">
<table cellspacing=0 cellpadding=0 border=0 style="width: 399px">
<tr>
<td valign=top align=left>
<TABLE border=0 cellPadding=0 cellSpacing=0 style="width: 403px">
<tr>
<td colSpan=2><IMG src="img/segpaqueteria_2013.jpg" ></td>
</tr>
</TABLE>
<table border="0" cellspacing="0" cellpadding="0" width=395>
<TR bgColor=#f9f4ed height=20>
<TD colspan=3 height=23 class=down>
<TABLE border=0 cellPadding=0 cellSpacing=0 width=395>
<TR bgColor=#f9e9d5 height=20>
<TD height=23 colspan = 4 class=down> <IMG height=10 src="img/bullet.gif" width=15> <font face=verdana size="1"><B>Envío Nro:</B> 4463400000000000255</font></TD>
</TR>
<TR bgColor=#f9f4ed height = 15>
<td bgColor=#f9e9d5 width = 1></td>
<TD class=texto ><font face=verdana size="1"><B> Remito Nro.:</B> </font></TD>
<td bgColor=white width = 1></td>
<TD class=texto><font face=verdana size="1"><B> Paquetes:</B> 1</font></TD>
</TR>
<TR height = 15>
<td bgColor=#f9e9d5 width = 1></td>
<TD bgColor=#f9e9d5 class=texto><font face=verdana size="1"><B> Retiro</B></font></TD>
<td bgColor=white width = 1></td>
<TD bgColor=#f9e9d5 class=texto><font face=verdana size="1"><B> Entrega</B></font></TD>
</TR>
<TR height = 15>
<td bgColor=#f9f4ed width = 1></td>
<TD bgColor=#f9f4ed class=texto><font face=verdana size="1"> AG RUSH SRL </font></td>
<td bgColor=white width = 1></td>
<TD bgColor=#f9f4ed class=texto><font face=verdana size="1"> S/D S/D</font></td>
</TR>
<TR height = 15>
<td bgColor=#f9f4ed width = 1></td>
<TD bgColor=#f9f4ed class=texto><font face=verdana size="1"> ARENAL CONCEPCION 3425 - 43</font></td>
<td bgColor=white width = 1></td>
<TD bgColor=#f9f4ed class=texto><font face=verdana size="1"> SANTIAGO 380 </font></td>
</TR>
<TR bgColor=#f9f4ed height = 15>
<td bgColor=#f9f4ed width = 1></td>
<TD bgColor=#f9f4ed class=texto><font face=verdana size="1"> Capital Federal</font></td>
<td bgColor=white width = 1></td>
<TD bgColor=#f9f4ed class=texto><font face=verdana size="1"> ROSARIO</font></td>
</TR>
<TR bgColor=#f9f4ed height = 15>
<td bgColor=#f9f4ed width = 1></td>
<TD bgColor=#f9f4ed class=texto><font face=verdana size="1"> 1427 - CAPITAL FEDERAL </font></td>
<td bgColor=white width = 1></td>
<TD bgColor=#f9f4ed class=texto><font face=verdana size="1"> 2000 - SANTA FE </font></td>
</TR>
</TABLE>
</TD>
</TR>
<TR bgColor=#f9e9d5 height=20>
<TD height=20 class=texto><font face=verdana size="1"> <B>Fecha</B></font></TD>
<td width=1 bgcolor=#ffffff height=20><SPACER type="block"></td>
<TD height=20 class=texto><font face=verdana size="1"> <B>Estado</B></font></TD>
</TR>
<TR bgColor=#f9f4ed height=20>
<TD height=20 class=texto><font face=verdana size="1">24/2/2015 </font></TD>
<td width=1 bgcolor=#ffffff height=20><SPACER type="block"></td>
<TD height=20 class=texto><font face=verdana size="1"> En Tránsito - Planta Velez Sarfield </font></TD>
</TR>
<TR bgColor=#f9e9d5 height=20>
<TD height=20 class=texto><font face=verdana size="1">24/2/2015 </font></TD>
<td width=1 bgcolor=#ffffff height=20><SPACER type="block"></td>
<TD height=20 class=texto><font face=verdana size="1"> Despachado a Sucursal de Destino - Planta Velez Sarfield </font></TD>
</TR>
<TR bgColor=#f9f4ed height=20>
<TD height=20 class=texto><font face=verdana size="1">25/2/2015 </font></TD>
<td width=1 bgcolor=#ffffff height=20><SPACER type="block"></td>
<TD height=20 class=texto><font face=verdana size="1"> En Tránsito a Suc. de Destino - ROSARIO </font></TD>
</TR>
</table>
<br>
<center><a href="#" onclick="javascript:history.back(1)">
<img src="img/ocaexprespak_volver.gif" border=0></a></center>
</td>
</tr>
</table>
</div>
</body>
结果示例
24/2/2015 - En Tránsito - Planta Velez Sarfield
24/2/2015 - Despachado a Sucursal de Destino - Planta Velez Sarfield
25/2/2015 - En Tránsito a Suc. de Destino - ROSARIO
到目前为止我做了什么:
URL = 'https://www1.oca.com.ar/OEPTrackingWeb/detalleenviore.asp?numero=4463400000000000255'
r = requests.get(URL, headers={'User-Agent': 'User-Agent'})
s = bs4.BeautifulSoup(r.text)
print s.body.table.table.next_sibling.next_sibling
我会查找列标签,然后从那里获取:
import re
header = s.find('b', text=re.compile('fecha', flags=re.I))
parent_row = header.find_parent('tr')
for row in parent_row.find_next_siblings('tr'):
cells = row.find_all('td', class_='texto')
date, entry = (c.get_text(strip=True) for c in cells)
获得 header 后,代码遍历回最近的 <tr>
行并遍历所有后续 table 行。带有文本的单元格有帮助 texto
class;在这些元素上使用 Element.get_text()
(使用 strip=True
删除这些单元格中的额外空格)为我们提供了您所需要的信息。
对于您的示例 URL 这会产生:
>>> import requests
>>> import bs4
>>> import re
>>> URL = 'https://www1.oca.com.ar/OEPTrackingWeb/detalleenviore.asp?numero=4463400000000000255'
>>> r = requests.get(URL, headers={'User-Agent': 'User-Agent'})
>>> s = bs4.BeautifulSoup(r.text)
>>> header = s.find('b', text=re.compile('fecha', flags=re.I))
>>> parent_row = header.find_parent('tr')
>>> for row in parent_row.find_next_siblings('tr'):
... cells = row.find_all('td', class_='texto')
... date, entry = (c.get_text(strip=True) for c in cells)
... print(date, entry)
...
24/2/2015 En Tránsito - Planta Velez Sarfield
24/2/2015 Despachado a Sucursal de Destino - Planta Velez Sarfield
25/2/2015 En Tránsito a Suc. de Destino - ROSARIO
我想同时获取日期和状态(在“Fecha”和“Estado”下)。 table
中可能有更多的 td 标签<body link="#000000" vlink="#000000" alink="#000000" leftmargin="15" topmargin="0" marginwidth="0" marginheight="0" bgcolor="#FFFFFF">
<table cellspacing=0 cellpadding=0 border=0 style="width: 399px">
<tr>
<td valign=top align=left>
<TABLE border=0 cellPadding=0 cellSpacing=0 style="width: 403px">
<tr>
<td colSpan=2><IMG src="img/segpaqueteria_2013.jpg" ></td>
</tr>
</TABLE>
<table border="0" cellspacing="0" cellpadding="0" width=395>
<TR bgColor=#f9f4ed height=20>
<TD colspan=3 height=23 class=down>
<TABLE border=0 cellPadding=0 cellSpacing=0 width=395>
<TR bgColor=#f9e9d5 height=20>
<TD height=23 colspan = 4 class=down> <IMG height=10 src="img/bullet.gif" width=15> <font face=verdana size="1"><B>Envío Nro:</B> 4463400000000000255</font></TD>
</TR>
<TR bgColor=#f9f4ed height = 15>
<td bgColor=#f9e9d5 width = 1></td>
<TD class=texto ><font face=verdana size="1"><B> Remito Nro.:</B> </font></TD>
<td bgColor=white width = 1></td>
<TD class=texto><font face=verdana size="1"><B> Paquetes:</B> 1</font></TD>
</TR>
<TR height = 15>
<td bgColor=#f9e9d5 width = 1></td>
<TD bgColor=#f9e9d5 class=texto><font face=verdana size="1"><B> Retiro</B></font></TD>
<td bgColor=white width = 1></td>
<TD bgColor=#f9e9d5 class=texto><font face=verdana size="1"><B> Entrega</B></font></TD>
</TR>
<TR height = 15>
<td bgColor=#f9f4ed width = 1></td>
<TD bgColor=#f9f4ed class=texto><font face=verdana size="1"> AG RUSH SRL </font></td>
<td bgColor=white width = 1></td>
<TD bgColor=#f9f4ed class=texto><font face=verdana size="1"> S/D S/D</font></td>
</TR>
<TR height = 15>
<td bgColor=#f9f4ed width = 1></td>
<TD bgColor=#f9f4ed class=texto><font face=verdana size="1"> ARENAL CONCEPCION 3425 - 43</font></td>
<td bgColor=white width = 1></td>
<TD bgColor=#f9f4ed class=texto><font face=verdana size="1"> SANTIAGO 380 </font></td>
</TR>
<TR bgColor=#f9f4ed height = 15>
<td bgColor=#f9f4ed width = 1></td>
<TD bgColor=#f9f4ed class=texto><font face=verdana size="1"> Capital Federal</font></td>
<td bgColor=white width = 1></td>
<TD bgColor=#f9f4ed class=texto><font face=verdana size="1"> ROSARIO</font></td>
</TR>
<TR bgColor=#f9f4ed height = 15>
<td bgColor=#f9f4ed width = 1></td>
<TD bgColor=#f9f4ed class=texto><font face=verdana size="1"> 1427 - CAPITAL FEDERAL </font></td>
<td bgColor=white width = 1></td>
<TD bgColor=#f9f4ed class=texto><font face=verdana size="1"> 2000 - SANTA FE </font></td>
</TR>
</TABLE>
</TD>
</TR>
<TR bgColor=#f9e9d5 height=20>
<TD height=20 class=texto><font face=verdana size="1"> <B>Fecha</B></font></TD>
<td width=1 bgcolor=#ffffff height=20><SPACER type="block"></td>
<TD height=20 class=texto><font face=verdana size="1"> <B>Estado</B></font></TD>
</TR>
<TR bgColor=#f9f4ed height=20>
<TD height=20 class=texto><font face=verdana size="1">24/2/2015 </font></TD>
<td width=1 bgcolor=#ffffff height=20><SPACER type="block"></td>
<TD height=20 class=texto><font face=verdana size="1"> En Tránsito - Planta Velez Sarfield </font></TD>
</TR>
<TR bgColor=#f9e9d5 height=20>
<TD height=20 class=texto><font face=verdana size="1">24/2/2015 </font></TD>
<td width=1 bgcolor=#ffffff height=20><SPACER type="block"></td>
<TD height=20 class=texto><font face=verdana size="1"> Despachado a Sucursal de Destino - Planta Velez Sarfield </font></TD>
</TR>
<TR bgColor=#f9f4ed height=20>
<TD height=20 class=texto><font face=verdana size="1">25/2/2015 </font></TD>
<td width=1 bgcolor=#ffffff height=20><SPACER type="block"></td>
<TD height=20 class=texto><font face=verdana size="1"> En Tránsito a Suc. de Destino - ROSARIO </font></TD>
</TR>
</table>
<br>
<center><a href="#" onclick="javascript:history.back(1)">
<img src="img/ocaexprespak_volver.gif" border=0></a></center>
</td>
</tr>
</table>
</div>
</body>
结果示例
24/2/2015 - En Tránsito - Planta Velez Sarfield
24/2/2015 - Despachado a Sucursal de Destino - Planta Velez Sarfield
25/2/2015 - En Tránsito a Suc. de Destino - ROSARIO
到目前为止我做了什么:
URL = 'https://www1.oca.com.ar/OEPTrackingWeb/detalleenviore.asp?numero=4463400000000000255'
r = requests.get(URL, headers={'User-Agent': 'User-Agent'})
s = bs4.BeautifulSoup(r.text)
print s.body.table.table.next_sibling.next_sibling
我会查找列标签,然后从那里获取:
import re
header = s.find('b', text=re.compile('fecha', flags=re.I))
parent_row = header.find_parent('tr')
for row in parent_row.find_next_siblings('tr'):
cells = row.find_all('td', class_='texto')
date, entry = (c.get_text(strip=True) for c in cells)
获得 header 后,代码遍历回最近的 <tr>
行并遍历所有后续 table 行。带有文本的单元格有帮助 texto
class;在这些元素上使用 Element.get_text()
(使用 strip=True
删除这些单元格中的额外空格)为我们提供了您所需要的信息。
对于您的示例 URL 这会产生:
>>> import requests
>>> import bs4
>>> import re
>>> URL = 'https://www1.oca.com.ar/OEPTrackingWeb/detalleenviore.asp?numero=4463400000000000255'
>>> r = requests.get(URL, headers={'User-Agent': 'User-Agent'})
>>> s = bs4.BeautifulSoup(r.text)
>>> header = s.find('b', text=re.compile('fecha', flags=re.I))
>>> parent_row = header.find_parent('tr')
>>> for row in parent_row.find_next_siblings('tr'):
... cells = row.find_all('td', class_='texto')
... date, entry = (c.get_text(strip=True) for c in cells)
... print(date, entry)
...
24/2/2015 En Tránsito - Planta Velez Sarfield
24/2/2015 Despachado a Sucursal de Destino - Planta Velez Sarfield
25/2/2015 En Tránsito a Suc. de Destino - ROSARIO