如何从 html table 获取文本?
How to get text from html table?
我有一个 html:
<table class= "tb1">
<thead>
<tr>
<th width="100">Country,<br>Other</br></th>
<th width="20">Total<br>Customers</br></th>
<th width="30">New<br>Customers</br></th>
<th width="30">Tests/<br/>
<nobr>1M cases</nobr>
</th>
<th style="display:none" width="30">Continent</th>
</tr>
</thead>
</table>
我使用 xpath 从每一行中获取文本。
'//table[@class="tb1"]//thead//tr//th/text()'
结果是:
['Country,', 'Other', 'Total', 'Customers', 'New', 'Customers', 'Tests/', '\n ', '\n ', 'Continent']
想要的结果:
['Country,Other', 'TotalCustomers', 'NewCustomers', 'Tests/1M cases', 'Continent']
我尝试使用:
'string(//table[@class="tb1"]//thead//tr//th)'
但结果只是:
Country,Other
首先使用xpath
获取每个<th>
,然后使用for
循环获取每个<th>
中的'.//text()'
然后你可以清理( ie. 移除 "new line") 并连接元素,为每个 <th>
创建一个字符串
import lxml.html
html ='''
<table class= "tb1">
<thead>
<tr>
<th width="100">Country,<br>Other</br></th>
<th width="20">Total<br>Customers</br></th>
<th width="30">New<br>Customers</br></th>
<th width="30">Tests/<br/>
<nobr>1M cases</nobr>
</th>
<th style="display:none" width="30">Continent</th>
</tr>
</thead>
</table>
'''
soup = lxml.html.fromstring(html)
results = []
for th in soup.xpath('//th'):
text = ''.join(x.strip() for x in th.xpath('.//text()'))
#text = ''.join(x.strip() for x in th.itertext())
results.append(text)
print(results)
我会使用 BeautifulSoup4:
pip install beautifulsoup4
这将为您的 table 和 return 获取每一行的列表; header 或数据
from bs4 import BeautifulSoup
html_text = '''<table class= "tb1">
<thead>
<tr>
<th width="100">Country,<br>Other</br></th>
<th width="20">Total<br>Customers</br></th>
<th width="30">New<br>Customers</br></th>
<th width="30">Tests/<br/>
<nobr>1M cases</nobr>
</th>
<th style="display:none" width="30">Continent</th>
</tr>
</thead>
<tbody>
<tr>
<td>Country1</td>
<td>20</td>
<td>3</td>
<td>1</td>
<td>Europe</td>
</tr>
<tr>
<td>Country2</td>
<td>15</td>
<td>1</td>
<td>3</td>
<td>North America</td>
</tr>
</tr>
</table>'''
soup = BeautifulSoup(html_text, 'html.parser')
def get_table():
table = []
for tr in soup.find_all('tr'):
# get headers
th = tr.find_all('th')
# get rows
td = tr.find_all('td')
# listify and combine them (just in case the html is structured weird somehow)
row = [i.text for i in th] + [i.text for i in td]
# append the new list to the table list
table.append(row)
return table
print(get_table())
输出:
[['Country,Other', 'TotalCustomers', 'NewCustomers', 'Tests/\n1M cases\n', 'Continent'], ['Country1', '20', '3', '1', 'Europe'], ['Country2', '15', '1', '3', 'North America']]
您也可以将其设为字典列表,其中 header 作为键,数据作为值,这在 python.
中可能更容易处理
我有一个 html:
<table class= "tb1">
<thead>
<tr>
<th width="100">Country,<br>Other</br></th>
<th width="20">Total<br>Customers</br></th>
<th width="30">New<br>Customers</br></th>
<th width="30">Tests/<br/>
<nobr>1M cases</nobr>
</th>
<th style="display:none" width="30">Continent</th>
</tr>
</thead>
</table>
我使用 xpath 从每一行中获取文本。
'//table[@class="tb1"]//thead//tr//th/text()'
结果是:
['Country,', 'Other', 'Total', 'Customers', 'New', 'Customers', 'Tests/', '\n ', '\n ', 'Continent']
想要的结果:
['Country,Other', 'TotalCustomers', 'NewCustomers', 'Tests/1M cases', 'Continent']
我尝试使用:
'string(//table[@class="tb1"]//thead//tr//th)'
但结果只是:
Country,Other
首先使用xpath
获取每个<th>
,然后使用for
循环获取每个<th>
中的'.//text()'
然后你可以清理( ie. 移除 "new line") 并连接元素,为每个 <th>
import lxml.html
html ='''
<table class= "tb1">
<thead>
<tr>
<th width="100">Country,<br>Other</br></th>
<th width="20">Total<br>Customers</br></th>
<th width="30">New<br>Customers</br></th>
<th width="30">Tests/<br/>
<nobr>1M cases</nobr>
</th>
<th style="display:none" width="30">Continent</th>
</tr>
</thead>
</table>
'''
soup = lxml.html.fromstring(html)
results = []
for th in soup.xpath('//th'):
text = ''.join(x.strip() for x in th.xpath('.//text()'))
#text = ''.join(x.strip() for x in th.itertext())
results.append(text)
print(results)
我会使用 BeautifulSoup4:
pip install beautifulsoup4
这将为您的 table 和 return 获取每一行的列表; header 或数据
from bs4 import BeautifulSoup
html_text = '''<table class= "tb1">
<thead>
<tr>
<th width="100">Country,<br>Other</br></th>
<th width="20">Total<br>Customers</br></th>
<th width="30">New<br>Customers</br></th>
<th width="30">Tests/<br/>
<nobr>1M cases</nobr>
</th>
<th style="display:none" width="30">Continent</th>
</tr>
</thead>
<tbody>
<tr>
<td>Country1</td>
<td>20</td>
<td>3</td>
<td>1</td>
<td>Europe</td>
</tr>
<tr>
<td>Country2</td>
<td>15</td>
<td>1</td>
<td>3</td>
<td>North America</td>
</tr>
</tr>
</table>'''
soup = BeautifulSoup(html_text, 'html.parser')
def get_table():
table = []
for tr in soup.find_all('tr'):
# get headers
th = tr.find_all('th')
# get rows
td = tr.find_all('td')
# listify and combine them (just in case the html is structured weird somehow)
row = [i.text for i in th] + [i.text for i in td]
# append the new list to the table list
table.append(row)
return table
print(get_table())
输出:
[['Country,Other', 'TotalCustomers', 'NewCustomers', 'Tests/\n1M cases\n', 'Continent'], ['Country1', '20', '3', '1', 'Europe'], ['Country2', '15', '1', '3', 'North America']]
您也可以将其设为字典列表,其中 header 作为键,数据作为值,这在 python.
中可能更容易处理