在 python (lxml) 中抓取嵌套和非结构化的 table
Scraping a nested and unstructured table in python (lxml)
我正在 scraping 的网站(使用 lxml )除了 table 之外的所有内容都工作正常其中所有 tr
、td
和标题 th
都是嵌套和混合的,形成一个非结构化的 HTML table.
<table class='table'>
<tr>
<th>Serial No.
<th>Full Name
<tr>
<td>1
<td rowspan='1'> John
<tr>
<td>2
<td rowspan='1'>Jane Alleman
<tr>
<td>3
<td rowspan='1'>Mukul Jha
.....
.....
.....
</table>
我尝试了以下 xpaths 但每一个都只是给我一个 空列表 .
persons = [x for x in tree.xpath('//table[@class="table"]/tr/th/th/tr/td/td/text()')]
persons = [x for x in tree.xpath('//table[@class="table"]/tr/td/td/text()')]
persons = [x for x in tree.xpath('//table[@class="table"]/tr/th/th/tr/td/td/text()') if x.isdigit() ==False] # to remove the serial no.s
最后,这样嵌套的原因是什么,是为了防止抓取吗?
似乎 lxml
加载 table 的方式与浏览器类似,它会在内存中创建正确的结构,您可以在使用 lxml.html.tostring(table)
[= 时看到正确的 HTML 16=]
因此它已正确格式化 table 并且需要正常 './tr/td//text()'
才能获取所有值
import requests
import lxml.html
text = requests.get('https://delhimetrorail.info/dwarka-sector-8-delhi-metro-station-to-dwarka-sector-14-delhi-metro-station').text
s = lxml.html.fromstring(text)
table = s.xpath('//table')[1]
for row in table.xpath('./tr'):
cells = row.xpath('./td//text()')
print(cells)
print(lxml.html.tostring(table, pretty_print=True).decode())
结果
['Fare', ' DMRC Rs. 30']
['Time', '0:14']
['First', '6:03']
['Last', '22:24']
['Phone ', '8800793196']
<table class="table">
<tr>
<td title="Monday To Saturday">Fare</td>
<td><div> DMRC Rs. 30</div></td>
</tr>
<tr>
<td>Time</td>
<td>0:14</td>
</tr>
<tr>
<td>First</td>
<td>6:03</td>
</tr>
<tr>
<td>Last</td>
<td>22:24</td>
</tr>
<tr>
<td>Phone </td>
<td><a href="tel:8800793196">8800793196</a></td>
</tr>
</table>
原始 HTML 用于比较 - 缺少结束标记
<table class='table'>
<tr><td title='Monday To Saturday'>Fare<td><div> DMRC Rs. 30</div></tr>
<tr><td>Time<td>0:14</tr>
<tr><td>First<td>6:03</tr>
<tr><td>Last<td>22:24
<tr><td>Phone <td><a href='tel:8800793196'>8800793196</a></tr>
</table>
类似于 furas 的答案,但使用 pandas 来抓取页面上的最后一个 table:
import requests
import lxml
import pandas as pd
url = 'https://delhimetrorail.info/dwarka-sector-8-delhi-metro-station-to-dwarka-sector-14-delhi-metro-station'
response = requests.get(url)
root = lxml.html.fromstring(response.text)
rows = []
info = root.xpath('//table[4]/tr/td[@rowspan]')
for i in info:
row = []
row.append(i.getprevious().text)
row.append(i.text)
rows.append(row)
columns = root.xpath('//table[4]//th/text()')
df1 = pd.DataFrame(rows, columns=columns)
df1
输出:
Gate Dwarka Sector 14 Metro Station
0 1 Eros Etro Mall
1 2 Nirmal Bharatiya Public School
我正在 scraping 的网站(使用 lxml )除了 table 之外的所有内容都工作正常其中所有 tr
、td
和标题 th
都是嵌套和混合的,形成一个非结构化的 HTML table.
<table class='table'>
<tr>
<th>Serial No.
<th>Full Name
<tr>
<td>1
<td rowspan='1'> John
<tr>
<td>2
<td rowspan='1'>Jane Alleman
<tr>
<td>3
<td rowspan='1'>Mukul Jha
.....
.....
.....
</table>
我尝试了以下 xpaths 但每一个都只是给我一个 空列表 .
persons = [x for x in tree.xpath('//table[@class="table"]/tr/th/th/tr/td/td/text()')]
persons = [x for x in tree.xpath('//table[@class="table"]/tr/td/td/text()')]
persons = [x for x in tree.xpath('//table[@class="table"]/tr/th/th/tr/td/td/text()') if x.isdigit() ==False] # to remove the serial no.s
最后,这样嵌套的原因是什么,是为了防止抓取吗?
似乎 lxml
加载 table 的方式与浏览器类似,它会在内存中创建正确的结构,您可以在使用 lxml.html.tostring(table)
[= 时看到正确的 HTML 16=]
因此它已正确格式化 table 并且需要正常 './tr/td//text()'
才能获取所有值
import requests
import lxml.html
text = requests.get('https://delhimetrorail.info/dwarka-sector-8-delhi-metro-station-to-dwarka-sector-14-delhi-metro-station').text
s = lxml.html.fromstring(text)
table = s.xpath('//table')[1]
for row in table.xpath('./tr'):
cells = row.xpath('./td//text()')
print(cells)
print(lxml.html.tostring(table, pretty_print=True).decode())
结果
['Fare', ' DMRC Rs. 30']
['Time', '0:14']
['First', '6:03']
['Last', '22:24']
['Phone ', '8800793196']
<table class="table">
<tr>
<td title="Monday To Saturday">Fare</td>
<td><div> DMRC Rs. 30</div></td>
</tr>
<tr>
<td>Time</td>
<td>0:14</td>
</tr>
<tr>
<td>First</td>
<td>6:03</td>
</tr>
<tr>
<td>Last</td>
<td>22:24</td>
</tr>
<tr>
<td>Phone </td>
<td><a href="tel:8800793196">8800793196</a></td>
</tr>
</table>
原始 HTML 用于比较 - 缺少结束标记
<table class='table'>
<tr><td title='Monday To Saturday'>Fare<td><div> DMRC Rs. 30</div></tr>
<tr><td>Time<td>0:14</tr>
<tr><td>First<td>6:03</tr>
<tr><td>Last<td>22:24
<tr><td>Phone <td><a href='tel:8800793196'>8800793196</a></tr>
</table>
类似于 furas 的答案,但使用 pandas 来抓取页面上的最后一个 table:
import requests
import lxml
import pandas as pd
url = 'https://delhimetrorail.info/dwarka-sector-8-delhi-metro-station-to-dwarka-sector-14-delhi-metro-station'
response = requests.get(url)
root = lxml.html.fromstring(response.text)
rows = []
info = root.xpath('//table[4]/tr/td[@rowspan]')
for i in info:
row = []
row.append(i.getprevious().text)
row.append(i.text)
rows.append(row)
columns = root.xpath('//table[4]//th/text()')
df1 = pd.DataFrame(rows, columns=columns)
df1
输出:
Gate Dwarka Sector 14 Metro Station
0 1 Eros Etro Mall
1 2 Nirmal Bharatiya Public School