如何从 html table 获取文本？

Question

我有一个 html:

<table class= "tb1">
<thead>
<tr>
<th width="100">Country,<br>Other</br></th>
<th width="20">Total<br>Customers</br></th>
<th width="30">New<br>Customers</br></th>
<th width="30">Tests/<br/>
<nobr>1M cases</nobr>
</th>
<th style="display:none" width="30">Continent</th>
</tr>
</thead>
</table>

我使用 xpath 从每一行中获取文本。

'//table[@class="tb1"]//thead//tr//th/text()'

结果是：

['Country,', 'Other', 'Total', 'Customers', 'New', 'Customers', 'Tests/', '\n    ', '\n    ', 'Continent']

想要的结果：

['Country,Other', 'TotalCustomers', 'NewCustomers', 'Tests/1M cases', 'Continent']

我尝试使用：

'string(//table[@class="tb1"]//thead//tr//th)'

但结果只是：

Country,Other

Answer 1

首先使用xpath获取每个<th>，然后使用for循环获取每个<th>中的'.//text()'然后你可以清理（ ie. 移除 "new line") 并连接元素，为每个 <th>

创建一个字符串

import lxml.html

html ='''
<table class= "tb1">
<thead>
<tr>
<th width="100">Country,<br>Other</br></th>
<th width="20">Total<br>Customers</br></th>
<th width="30">New<br>Customers</br></th>
<th width="30">Tests/<br/>
<nobr>1M cases</nobr>
</th>
<th style="display:none" width="30">Continent</th>
</tr>
</thead>
</table>
'''

soup = lxml.html.fromstring(html)

results = []

for th in soup.xpath('//th'):
    text = ''.join(x.strip() for x in th.xpath('.//text()'))
    #text = ''.join(x.strip() for x in th.itertext())
    results.append(text)

print(results)

Answer 2

我会使用 BeautifulSoup4:

pip install beautifulsoup4

这将为您的 table 和 return 获取每一行的列表； header 或数据

from bs4 import BeautifulSoup


html_text = '''<table class= "tb1">
<thead>
    <tr>
        <th width="100">Country,<br>Other</br></th>
        <th width="20">Total<br>Customers</br></th>
        <th width="30">New<br>Customers</br></th>
        <th width="30">Tests/<br/>
            <nobr>1M cases</nobr>
        </th>
        <th style="display:none" width="30">Continent</th>
    </tr>
</thead>
<tbody>
    <tr>
        <td>Country1</td>
        <td>20</td>
        <td>3</td>
        <td>1</td>
        <td>Europe</td>
    </tr>
    <tr>
        <td>Country2</td>
        <td>15</td>
        <td>1</td>
        <td>3</td>
        <td>North America</td>
    </tr>
</tr>

</table>'''

soup = BeautifulSoup(html_text, 'html.parser')

def get_table():
    table = []
    for tr in soup.find_all('tr'):
        # get headers
        th = tr.find_all('th')
        # get rows
        td = tr.find_all('td')
        # listify and combine them (just in case the html is structured weird somehow)
        row = [i.text for i in th] + [i.text for i in td]
        # append the new list to the table list
        table.append(row)
    return table

print(get_table())

输出：

[['Country,Other', 'TotalCustomers', 'NewCustomers', 'Tests/\n1M cases\n', 'Continent'], ['Country1', '20', '3', '1', 'Europe'], ['Country2', '15', '1', '3', 'North America']]

您也可以将其设为字典列表，其中 header 作为键，数据作为值，这在 python.

中可能更容易处理

如何从 html table 获取文本？

How to get text from html table?

python

xpath

lxml