在 python (lxml) 中抓取嵌套和非结构化的 table

Scraping a nested and unstructured table in python (lxml)

我正在 scraping 的网站(使用 lxml )除了 table 之外的所有内容都工作正常其中所有 trtd 和标题 th 都是嵌套和混合的,形成一个非结构化的 HTML table.

<table class='table'>
    <tr>
        <th>Serial No.
            <th>Full Name
                <tr>
                    <td>1
                        <td rowspan='1'> John 
                            <tr>
                                <td>2
                                    <td rowspan='1'>Jane Alleman
                                        <tr>
                                            <td>3
                                                <td rowspan='1'>Mukul Jha
                                                 .....
                                                 .....
                                                 .....
</table>

我尝试了以下 xpaths 但每一个都只是给我一个 空列表 .

persons = [x for x in tree.xpath('//table[@class="table"]/tr/th/th/tr/td/td/text()')]

persons = [x for x in tree.xpath('//table[@class="table"]/tr/td/td/text()')]

persons = [x for x in tree.xpath('//table[@class="table"]/tr/th/th/tr/td/td/text()') if x.isdigit() ==False] # to remove the serial no.s

最后,这样嵌套的原因是什么,是为了防止抓取吗?

似乎 lxml 加载 table 的方式与浏览器类似,它会在内存中创建正确的结构,您可以在使用 lxml.html.tostring(table)[= 时看到正确的 HTML 16=]

因此它已正确格式化 table 并且需要正常 './tr/td//text()' 才能获取所有值

import requests
import lxml.html

text = requests.get('https://delhimetrorail.info/dwarka-sector-8-delhi-metro-station-to-dwarka-sector-14-delhi-metro-station').text

s = lxml.html.fromstring(text)

table = s.xpath('//table')[1]

for row in table.xpath('./tr'):
    cells = row.xpath('./td//text()')
    print(cells)

print(lxml.html.tostring(table, pretty_print=True).decode())

结果

['Fare', ' DMRC Rs. 30']
['Time', '0:14']
['First', '6:03']
['Last', '22:24']
['Phone ', '8800793196']

<table class="table">
<tr>
<td title="Monday To Saturday">Fare</td>
<td><div> DMRC Rs. 30</div></td>
</tr>
<tr>
<td>Time</td>
<td>0:14</td>
</tr>
<tr>
<td>First</td>
<td>6:03</td>
</tr>
<tr>
<td>Last</td>
<td>22:24</td>
</tr>
<tr>
<td>Phone </td>
<td><a href="tel:8800793196">8800793196</a></td>
</tr>
</table>

原始 HTML 用于比较 - 缺少结束标记

<table class='table'>
<tr><td  title='Monday To Saturday'>Fare<td><div> DMRC Rs. 30</div></tr>
<tr><td>Time<td>0:14</tr>
<tr><td>First<td>6:03</tr>
<tr><td>Last<td>22:24
<tr><td>Phone <td><a href='tel:8800793196'>8800793196</a></tr>
</table>

类似于 furas 的答案,但使用 pandas 来抓取页面上的最后一个 table:

import requests
import lxml
import pandas as pd

url = 'https://delhimetrorail.info/dwarka-sector-8-delhi-metro-station-to-dwarka-sector-14-delhi-metro-station'
response = requests.get(url)

root = lxml.html.fromstring(response.text)
rows = []
info = root.xpath('//table[4]/tr/td[@rowspan]')
for i in info:
    row = []
    row.append(i.getprevious().text)
    row.append(i.text)
    rows.append(row)

columns = root.xpath('//table[4]//th/text()')
df1 = pd.DataFrame(rows, columns=columns)
df1

输出:

   Gate Dwarka Sector 14 Metro Station
0   1   Eros Etro Mall
1   2   Nirmal Bharatiya Public School