HTML 桌 python 美汤
HTML tables with python beautiful soup
我有一个 HTML table 看起来像这样 :
<table border=0 cellspacing=1 cellpadding=2 class=form>
<tr class=form><td class=formlabel>Heating Coils in Bunker Tanks</td><td class=form>N</td></tr>
<tr class=forma><td class=formlabel>Heating Coils in Cargo Tanks</td><td class=form>U</td></tr>
<tr class=form><td class=formlabel>Manifold Type</td><td class=form>N</td></tr>
<tr class=forma><td class=formlabel>No. Holds</td><td class=form>5</td></tr>
<tr class=form><td class=formlabel>No. Centreline Hatches</td><td class=form>5</td></tr>
<tr class=forma><td class=formlabel>Lifting Gear</td><td class=form>Yes</td></tr>
<tr class=form><td class=formlabel>Gear</td><td class=form>4 Crane (30.5t SWL)</td></tr>
<tr class=forma><td class=formlabel>Alteration</td><td class=form>Unknown</td></tr>
</table>
我正在使用 Beautiful soup 来提取来自 scrapy 蜘蛛的响应的特定数据
soup = BeautifulSoup(response.body_as_unicode())
table= soup.find('table', {'class': 'form'})
# psusedo code find manifold type and number of Holds
我该怎么做 this.Do 请注意,值的顺序可能会改变,但表单标签始终保持不变?我如何使用特定的表单标签进行搜索?
编辑:
<tr class=forma><td class=formlabel>Fleet Manager (Operator)</td><td class=form><a href="oBasic.asp?LRNumber=9442964&Action=Display&LRCompanyNumber=40916">ESSAR SHIPPING LTD</a></td></tr>
这种特殊情况不适用于以下同级搜索?如何克服这个问题?
您可以找到 td
元素 by text and get the next sibling:
table.find('td', text='Manifold Type').next_sibling.text
附带说明一下,为什么需要在 Scrapy 蜘蛛中使用 BeautifulSoup
? Scrapy
本身在 HTML 解析、定位元素方面非常强大:
response.xpath('//table[@class="form"]//td[.="Manifold Type"]/following-sibling::td/text()')
来自 scrapy shell
的演示:
$ scrapy shell index.html
In [1]: response.xpath('//table[@class="form"]//td[.="Manifold Type"]/following-sibling::td/text()').extract()
Out[1]: [u'N']
我有一个 HTML table 看起来像这样 :
<table border=0 cellspacing=1 cellpadding=2 class=form>
<tr class=form><td class=formlabel>Heating Coils in Bunker Tanks</td><td class=form>N</td></tr>
<tr class=forma><td class=formlabel>Heating Coils in Cargo Tanks</td><td class=form>U</td></tr>
<tr class=form><td class=formlabel>Manifold Type</td><td class=form>N</td></tr>
<tr class=forma><td class=formlabel>No. Holds</td><td class=form>5</td></tr>
<tr class=form><td class=formlabel>No. Centreline Hatches</td><td class=form>5</td></tr>
<tr class=forma><td class=formlabel>Lifting Gear</td><td class=form>Yes</td></tr>
<tr class=form><td class=formlabel>Gear</td><td class=form>4 Crane (30.5t SWL)</td></tr>
<tr class=forma><td class=formlabel>Alteration</td><td class=form>Unknown</td></tr>
</table>
我正在使用 Beautiful soup 来提取来自 scrapy 蜘蛛的响应的特定数据
soup = BeautifulSoup(response.body_as_unicode())
table= soup.find('table', {'class': 'form'})
# psusedo code find manifold type and number of Holds
我该怎么做 this.Do 请注意,值的顺序可能会改变,但表单标签始终保持不变?我如何使用特定的表单标签进行搜索?
编辑:
<tr class=forma><td class=formlabel>Fleet Manager (Operator)</td><td class=form><a href="oBasic.asp?LRNumber=9442964&Action=Display&LRCompanyNumber=40916">ESSAR SHIPPING LTD</a></td></tr>
这种特殊情况不适用于以下同级搜索?如何克服这个问题?
您可以找到 td
元素 by text and get the next sibling:
table.find('td', text='Manifold Type').next_sibling.text
附带说明一下,为什么需要在 Scrapy 蜘蛛中使用 BeautifulSoup
? Scrapy
本身在 HTML 解析、定位元素方面非常强大:
response.xpath('//table[@class="form"]//td[.="Manifold Type"]/following-sibling::td/text()')
来自 scrapy shell
的演示:
$ scrapy shell index.html
In [1]: response.xpath('//table[@class="form"]//td[.="Manifold Type"]/following-sibling::td/text()').extract()
Out[1]: [u'N']