如何使用 beautifulsoup 从多个表中提取数据?
How do I extract data from multiples tables using beautifulsoup?
我有28个这样的数据表。我使用 Beautifulsoup 提取了 html。我需要通过从这些表中的单元格中剥离数据来创建一个 csv 文件。我是 python 的新手。我尝试使用 Beautifulsoup 但我无法使其工作。如何遍历表格以创建 csv?
<table border="1" cellpadding="2" cellspacing="0" width="600">
<tr>
<th colspan="3">
Chatsworth, Ga
</th>
<th colspan="6">
Forecast NFDRS-88 Valid at 1300 EST Jan 26 2015
</th>
</tr>
<tr>
<td bgcolor="#C0C0C0" width="50">
<b>
<font size="2">
RH (%)
</font>
</b>
</td>
<td bgcolor="#C0C0C0" width="60">
<b>
<font size="2">
IC
</font>
</b>
</td>
<td bgcolor="#C0C0C0" width="60">
<b>
<font size="2">
BI
</font>
</b>
</td>
<td bgcolor="#C0C0C0" width="100">
<b>
<font size="2">
Class Day
</font>
</b>
</td>
<td bgcolor="#C0C0C0" width="55">
<b>
<font size="2">
KBDI
</font>
</b>
</td>
<td bgcolor="#C0C0C0" width="80">
<b>
<font size="2">
Wind (mph)
</font>
</b>
</td>
<td bgcolor="#C0C0C0" width="80">
<b>
<font size="2">
Mx_Wind
<br/>
(mph)
</font>
</b>
</td>
<td bgcolor="#C0C0C0" width="40">
<b>
<font size="2">
Rn24
<br/>
(inch)
</font>
</b>
</td>
<td bgcolor="#C0C0C0" width="50">
<b>
<font size="2">
Dur
<br/>
(Hr)
</font>
</b>
</td>
</tr>
<tr>
<td align="center">
<font size="2">
78
</font>
</td>
<td align="center">
<font size="2">
0
</font>
</td>
<td align="center">
<font size="2">
12
</font>
</td>
<td align="center">
<font size="2">
1
Low
</font>
</td>
<td align="center">
<font size="2">
2
</font>
</td>
<td align="center">
<font size="2">
NW 10
</font>
</td>
<td align="center">
<font size="2">
NW 17
</font>
</td>
<td align="center">
<font size="2">
0.05
</font>
</td>
<td align="center">
<font size="2">
0
</font>
</td>
</tr>
<tr>
<td bgcolor="#C0C0C0">
<b>
<font size="2">
Sow
</font>
</b>
</td>
<td bgcolor="#C0C0C0">
<b>
<font size="2">
Temp (°F)
</font>
</b>
</td>
<td bgcolor="#C0C0C0">
<b>
<font size="2">
Td (°F)
</font>
</b>
</td>
<td bgcolor="#C0C0C0">
<b>
<font size="2">
Tmax (°F)
</font>
</b>
</td>
<td bgcolor="#C0C0C0">
<b>
<font size="2">
Tmin (°F)
</font>
</b>
</td>
<td bgcolor="#C0C0C0">
<b>
<font size="2">
RHMax (%)
</font>
</b>
</td>
<td bgcolor="#C0C0C0" colspan="1">
<b>
<font size="2">
RHMin (%)
</font>
</b>
</td>
<td bgcolor="#C0C0C0" colspan="1">
<b>
<font size="2">
HrbGF
</font>
</b>
</td>
<td bgcolor="#C0C0C0" colspan="1">
<b>
<font size="2">
WdyGF
</font>
</b>
</td>
</tr>
<tr>
<td align="center">
<font size="2">
4
</font>
</td>
<td align="center">
<font size="2">
41
</font>
</td>
<td align="center">
<font size="2">
34
</font>
</td>
<td align="center">
<font size="2">
41
</font>
</td>
<td align="center">
<font size="2">
38
</font>
</td>
<td align="center">
<font size="2">
86
</font>
</td>
<td align="center" colspan="1">
<font size="2">
70
</font>
</td>
<td align="center" colspan="1">
<font size="2">
0
</font>
</td>
<td align="center" colspan="1">
<font size="2">
0
</font>
</td>
</tr>
<tr>
<td bgcolor="#C0C0C0">
<b>
<font size="2">
1-Hour
</font>
</b>
</td>
<td bgcolor="#C0C0C0">
<b>
<font size="2">
10-Hour
</font>
</b>
</td>
<td bgcolor="#C0C0C0">
<b>
<font size="2">
100-Hour
</font>
</b>
</td>
<td bgcolor="#C0C0C0">
<b>
<font size="2">
1000-Hour
</font>
</b>
</td>
<td bgcolor="#C0C0C0">
<b>
<font size="2">
X1000
</font>
</b>
</td>
<td bgcolor="#C0C0C0" colspan="1">
<b>
<font size="2">
Herbaceous
</font>
</b>
</td>
<td bgcolor="#C0C0C0" colspan="1">
<b>
<font size="2">
Woody
</font>
</b>
</td>
<td bgcolor="#C0C0C0">
<b>
<font size="2">
SC
</font>
</b>
</td>
<td bgcolor="#C0C0C0">
<b>
<font size="2">
EC
</font>
</b>
</td>
</tr>
<tr>
<td align="center">
<font size="2">
20.4
</font>
</td>
<td align="center">
<font size="2">
20.4
</font>
</td>
<td align="center">
<font size="2">
21.0
</font>
</td>
<td align="center">
<font size="2">
24.7
</font>
</td>
<td align="center">
<font size="2">
24.7
</font>
</td>
<td align="center" colspan="1">
<font size="2">
20.4
</font>
</td>
<td align="center" colspan="1">
<font size="2">
70.0
</font>
</td>
<td align="center">
<font size="2">
7
</font>
</td>
<td align="center">
<font size="2">
3
</font>
</td>
</tr>
</table>
<p>
<a name="#Dallas">
</a>
</p>
<table border="1" cellpadding="2" cellspacing="0" width="600">
<tr>
<th colspan="3">
Dallas, Ga
</th>
<th colspan="6">
Forecast NFDRS-88 Valid at 1300 EST Jan 26 2015
</th>
</tr>
<tr>
<td bgcolor="#C0C0C0" width="50">
<b>
<font size="2">
RH (%)
</font>
</b>
</td>
<td bgcolor="#C0C0C0" width="60">
<b>
<font size="2">
IC
</font>
</b>
</td>
<td bgcolor="#C0C0C0" width="60">
<b>
<font size="2">
BI
</font>
</b>
</td>
<td bgcolor="#C0C0C0" width="100">
<b>
<font size="2">
Class Day
</font>
</b>
</td>
<td bgcolor="#C0C0C0" width="55">
<b>
<font size="2">
KBDI
</font>
</b>
</td>
<td bgcolor="#C0C0C0" width="80">
<b>
<font size="2">
Wind (mph)
</font>
</b>
</td>
<td bgcolor="#C0C0C0" width="80">
<b>
<font size="2">
Mx_Wind
<br/>
(mph)
</font>
</b>
</td>
<td bgcolor="#C0C0C0" width="40">
<b>
<font size="2">
Rn24
<br/>
(inch)
</font>
</b>
</td>
<td bgcolor="#C0C0C0" width="50">
<b>
<font size="2">
Dur
<br/>
(Hr)
</font>
</b>
</td>
</tr>
<tr>
<td align="center">
<font size="2">
57
</font>
</td>
<td align="center">
<font size="2">
3
</font>
</td>
<td align="center">
<font size="2">
17
</font>
</td>
<td align="center">
<font size="2">
2
Moderate
</font>
</td>
<td align="center">
<font size="2">
3
</font>
</td>
<td align="center">
<font size="2">
N 7
</font>
</td>
<td align="center">
<font size="2">
N 12
</font>
</td>
<td align="center">
<font size="2">
0.00
</font>
</td>
<td align="center">
<font size="2">
0
</font>
</td>
</tr>
<tr>
<td bgcolor="#C0C0C0">
<b>
<font size="2">
Sow
</font>
</b>
</td>
<td bgcolor="#C0C0C0">
<b>
<font size="2">
Temp (°F)
</font>
</b>
</td>
<td bgcolor="#C0C0C0">
<b>
<font size="2">
Td (°F)
</font>
</b>
</td>
<td bgcolor="#C0C0C0">
<b>
<font size="2">
Tmax (°F)
</font>
</b>
</td>
<td bgcolor="#C0C0C0">
<b>
<font size="2">
Tmin (°F)
</font>
</b>
</td>
<td bgcolor="#C0C0C0">
<b>
<font size="2">
RHMax (%)
</font>
</b>
</td>
<td bgcolor="#C0C0C0" colspan="1">
<b>
<font size="2">
RHMin (%)
</font>
</b>
</td>
<td bgcolor="#C0C0C0" colspan="1">
<b>
<font size="2">
HrbGF
</font>
</b>
</td>
<td bgcolor="#C0C0C0" colspan="1">
<b>
<font size="2">
WdyGF
</font>
</b>
</td>
</tr>
<tr>
<td align="center">
<font size="2">
4
</font>
</td>
<td align="center">
<font size="2">
45
</font>
</td>
<td align="center">
<font size="2">
30
</font>
</td>
<td align="center">
<font size="2">
46
</font>
</td>
<td align="center">
<font size="2">
38
</font>
</td>
<td align="center">
<font size="2">
82
</font>
</td>
<td align="center" colspan="1">
<font size="2">
57
</font>
</td>
<td align="center" colspan="1">
<font size="2">
0
</font>
</td>
<td align="center" colspan="1">
<font size="2">
0
</font>
</td>
</tr>
<tr>
<td bgcolor="#C0C0C0">
<b>
<font size="2">
1-Hour
</font>
</b>
</td>
<td bgcolor="#C0C0C0">
<b>
<font size="2">
10-Hour
</font>
</b>
</td>
<td bgcolor="#C0C0C0">
<b>
<font size="2">
100-Hour
</font>
</b>
</td>
<td bgcolor="#C0C0C0">
<b>
<font size="2">
1000-Hour
</font>
</b>
</td>
<td bgcolor="#C0C0C0">
<b>
<font size="2">
X1000
</font>
</b>
</td>
<td bgcolor="#C0C0C0" colspan="1">
<b>
<font size="2">
Herbaceous
</font>
</b>
</td>
<td bgcolor="#C0C0C0" colspan="1">
<b>
<font size="2">
Woody
</font>
</b>
</td>
<td bgcolor="#C0C0C0">
<b>
<font size="2">
SC
</font>
</b>
</td>
<td bgcolor="#C0C0C0">
<b>
<font size="2">
EC
</font>
</b>
</td>
</tr>
<tr>
<td align="center">
<font size="2">
13.6
</font>
</td>
<td align="center">
<font size="2">
13.6
</font>
</td>
<td align="center">
<font size="2">
18.0
</font>
</td>
<td align="center">
<font size="2">
20.2
</font>
</td>
<td align="center">
<font size="2">
20.2
</font>
</td>
<td align="center" colspan="1">
<font size="2">
13.6
</font>
</td>
<td align="center" colspan="1">
<font size="2">
70.0
</font>
</td>
<td align="center">
<font size="2">
4
</font>
</td>
<td align="center">
<font size="2">
10
</font>
</td>
</tr>
</table>
from BeautifulSoup import BeautifulSoup
html = '''
PASTE YOUR HTML HERE
'''
bs = BeautifulSoup(html)
csv = ''
for table in bs.findAll('table'):
for row in table.findChildren('tr'):
for cell in row.findChildren('th')+row.findChildren('td'):
csv += '"'+cell.text.replace('\r','').replace('\n','')+'"'+(','*(int(cell['colspan'])-1) if cell.has_key('colspan') else '')+','
if len(row) > 0:
csv += '\n'
with open('test.csv','w') as f:
f.write(csv.encode('utf-8'))
您可以像这样遍历 table 和每个 table 中的标签:
soup = BeautifulSoup(<your html>)
for tbl in soup.find_all('table'):
for td in tbl.find_all('td'):
# do things with td
print td.text.strip()
我有28个这样的数据表。我使用 Beautifulsoup 提取了 html。我需要通过从这些表中的单元格中剥离数据来创建一个 csv 文件。我是 python 的新手。我尝试使用 Beautifulsoup 但我无法使其工作。如何遍历表格以创建 csv?
<table border="1" cellpadding="2" cellspacing="0" width="600">
<tr>
<th colspan="3">
Chatsworth, Ga
</th>
<th colspan="6">
Forecast NFDRS-88 Valid at 1300 EST Jan 26 2015
</th>
</tr>
<tr>
<td bgcolor="#C0C0C0" width="50">
<b>
<font size="2">
RH (%)
</font>
</b>
</td>
<td bgcolor="#C0C0C0" width="60">
<b>
<font size="2">
IC
</font>
</b>
</td>
<td bgcolor="#C0C0C0" width="60">
<b>
<font size="2">
BI
</font>
</b>
</td>
<td bgcolor="#C0C0C0" width="100">
<b>
<font size="2">
Class Day
</font>
</b>
</td>
<td bgcolor="#C0C0C0" width="55">
<b>
<font size="2">
KBDI
</font>
</b>
</td>
<td bgcolor="#C0C0C0" width="80">
<b>
<font size="2">
Wind (mph)
</font>
</b>
</td>
<td bgcolor="#C0C0C0" width="80">
<b>
<font size="2">
Mx_Wind
<br/>
(mph)
</font>
</b>
</td>
<td bgcolor="#C0C0C0" width="40">
<b>
<font size="2">
Rn24
<br/>
(inch)
</font>
</b>
</td>
<td bgcolor="#C0C0C0" width="50">
<b>
<font size="2">
Dur
<br/>
(Hr)
</font>
</b>
</td>
</tr>
<tr>
<td align="center">
<font size="2">
78
</font>
</td>
<td align="center">
<font size="2">
0
</font>
</td>
<td align="center">
<font size="2">
12
</font>
</td>
<td align="center">
<font size="2">
1
Low
</font>
</td>
<td align="center">
<font size="2">
2
</font>
</td>
<td align="center">
<font size="2">
NW 10
</font>
</td>
<td align="center">
<font size="2">
NW 17
</font>
</td>
<td align="center">
<font size="2">
0.05
</font>
</td>
<td align="center">
<font size="2">
0
</font>
</td>
</tr>
<tr>
<td bgcolor="#C0C0C0">
<b>
<font size="2">
Sow
</font>
</b>
</td>
<td bgcolor="#C0C0C0">
<b>
<font size="2">
Temp (°F)
</font>
</b>
</td>
<td bgcolor="#C0C0C0">
<b>
<font size="2">
Td (°F)
</font>
</b>
</td>
<td bgcolor="#C0C0C0">
<b>
<font size="2">
Tmax (°F)
</font>
</b>
</td>
<td bgcolor="#C0C0C0">
<b>
<font size="2">
Tmin (°F)
</font>
</b>
</td>
<td bgcolor="#C0C0C0">
<b>
<font size="2">
RHMax (%)
</font>
</b>
</td>
<td bgcolor="#C0C0C0" colspan="1">
<b>
<font size="2">
RHMin (%)
</font>
</b>
</td>
<td bgcolor="#C0C0C0" colspan="1">
<b>
<font size="2">
HrbGF
</font>
</b>
</td>
<td bgcolor="#C0C0C0" colspan="1">
<b>
<font size="2">
WdyGF
</font>
</b>
</td>
</tr>
<tr>
<td align="center">
<font size="2">
4
</font>
</td>
<td align="center">
<font size="2">
41
</font>
</td>
<td align="center">
<font size="2">
34
</font>
</td>
<td align="center">
<font size="2">
41
</font>
</td>
<td align="center">
<font size="2">
38
</font>
</td>
<td align="center">
<font size="2">
86
</font>
</td>
<td align="center" colspan="1">
<font size="2">
70
</font>
</td>
<td align="center" colspan="1">
<font size="2">
0
</font>
</td>
<td align="center" colspan="1">
<font size="2">
0
</font>
</td>
</tr>
<tr>
<td bgcolor="#C0C0C0">
<b>
<font size="2">
1-Hour
</font>
</b>
</td>
<td bgcolor="#C0C0C0">
<b>
<font size="2">
10-Hour
</font>
</b>
</td>
<td bgcolor="#C0C0C0">
<b>
<font size="2">
100-Hour
</font>
</b>
</td>
<td bgcolor="#C0C0C0">
<b>
<font size="2">
1000-Hour
</font>
</b>
</td>
<td bgcolor="#C0C0C0">
<b>
<font size="2">
X1000
</font>
</b>
</td>
<td bgcolor="#C0C0C0" colspan="1">
<b>
<font size="2">
Herbaceous
</font>
</b>
</td>
<td bgcolor="#C0C0C0" colspan="1">
<b>
<font size="2">
Woody
</font>
</b>
</td>
<td bgcolor="#C0C0C0">
<b>
<font size="2">
SC
</font>
</b>
</td>
<td bgcolor="#C0C0C0">
<b>
<font size="2">
EC
</font>
</b>
</td>
</tr>
<tr>
<td align="center">
<font size="2">
20.4
</font>
</td>
<td align="center">
<font size="2">
20.4
</font>
</td>
<td align="center">
<font size="2">
21.0
</font>
</td>
<td align="center">
<font size="2">
24.7
</font>
</td>
<td align="center">
<font size="2">
24.7
</font>
</td>
<td align="center" colspan="1">
<font size="2">
20.4
</font>
</td>
<td align="center" colspan="1">
<font size="2">
70.0
</font>
</td>
<td align="center">
<font size="2">
7
</font>
</td>
<td align="center">
<font size="2">
3
</font>
</td>
</tr>
</table>
<p>
<a name="#Dallas">
</a>
</p>
<table border="1" cellpadding="2" cellspacing="0" width="600">
<tr>
<th colspan="3">
Dallas, Ga
</th>
<th colspan="6">
Forecast NFDRS-88 Valid at 1300 EST Jan 26 2015
</th>
</tr>
<tr>
<td bgcolor="#C0C0C0" width="50">
<b>
<font size="2">
RH (%)
</font>
</b>
</td>
<td bgcolor="#C0C0C0" width="60">
<b>
<font size="2">
IC
</font>
</b>
</td>
<td bgcolor="#C0C0C0" width="60">
<b>
<font size="2">
BI
</font>
</b>
</td>
<td bgcolor="#C0C0C0" width="100">
<b>
<font size="2">
Class Day
</font>
</b>
</td>
<td bgcolor="#C0C0C0" width="55">
<b>
<font size="2">
KBDI
</font>
</b>
</td>
<td bgcolor="#C0C0C0" width="80">
<b>
<font size="2">
Wind (mph)
</font>
</b>
</td>
<td bgcolor="#C0C0C0" width="80">
<b>
<font size="2">
Mx_Wind
<br/>
(mph)
</font>
</b>
</td>
<td bgcolor="#C0C0C0" width="40">
<b>
<font size="2">
Rn24
<br/>
(inch)
</font>
</b>
</td>
<td bgcolor="#C0C0C0" width="50">
<b>
<font size="2">
Dur
<br/>
(Hr)
</font>
</b>
</td>
</tr>
<tr>
<td align="center">
<font size="2">
57
</font>
</td>
<td align="center">
<font size="2">
3
</font>
</td>
<td align="center">
<font size="2">
17
</font>
</td>
<td align="center">
<font size="2">
2
Moderate
</font>
</td>
<td align="center">
<font size="2">
3
</font>
</td>
<td align="center">
<font size="2">
N 7
</font>
</td>
<td align="center">
<font size="2">
N 12
</font>
</td>
<td align="center">
<font size="2">
0.00
</font>
</td>
<td align="center">
<font size="2">
0
</font>
</td>
</tr>
<tr>
<td bgcolor="#C0C0C0">
<b>
<font size="2">
Sow
</font>
</b>
</td>
<td bgcolor="#C0C0C0">
<b>
<font size="2">
Temp (°F)
</font>
</b>
</td>
<td bgcolor="#C0C0C0">
<b>
<font size="2">
Td (°F)
</font>
</b>
</td>
<td bgcolor="#C0C0C0">
<b>
<font size="2">
Tmax (°F)
</font>
</b>
</td>
<td bgcolor="#C0C0C0">
<b>
<font size="2">
Tmin (°F)
</font>
</b>
</td>
<td bgcolor="#C0C0C0">
<b>
<font size="2">
RHMax (%)
</font>
</b>
</td>
<td bgcolor="#C0C0C0" colspan="1">
<b>
<font size="2">
RHMin (%)
</font>
</b>
</td>
<td bgcolor="#C0C0C0" colspan="1">
<b>
<font size="2">
HrbGF
</font>
</b>
</td>
<td bgcolor="#C0C0C0" colspan="1">
<b>
<font size="2">
WdyGF
</font>
</b>
</td>
</tr>
<tr>
<td align="center">
<font size="2">
4
</font>
</td>
<td align="center">
<font size="2">
45
</font>
</td>
<td align="center">
<font size="2">
30
</font>
</td>
<td align="center">
<font size="2">
46
</font>
</td>
<td align="center">
<font size="2">
38
</font>
</td>
<td align="center">
<font size="2">
82
</font>
</td>
<td align="center" colspan="1">
<font size="2">
57
</font>
</td>
<td align="center" colspan="1">
<font size="2">
0
</font>
</td>
<td align="center" colspan="1">
<font size="2">
0
</font>
</td>
</tr>
<tr>
<td bgcolor="#C0C0C0">
<b>
<font size="2">
1-Hour
</font>
</b>
</td>
<td bgcolor="#C0C0C0">
<b>
<font size="2">
10-Hour
</font>
</b>
</td>
<td bgcolor="#C0C0C0">
<b>
<font size="2">
100-Hour
</font>
</b>
</td>
<td bgcolor="#C0C0C0">
<b>
<font size="2">
1000-Hour
</font>
</b>
</td>
<td bgcolor="#C0C0C0">
<b>
<font size="2">
X1000
</font>
</b>
</td>
<td bgcolor="#C0C0C0" colspan="1">
<b>
<font size="2">
Herbaceous
</font>
</b>
</td>
<td bgcolor="#C0C0C0" colspan="1">
<b>
<font size="2">
Woody
</font>
</b>
</td>
<td bgcolor="#C0C0C0">
<b>
<font size="2">
SC
</font>
</b>
</td>
<td bgcolor="#C0C0C0">
<b>
<font size="2">
EC
</font>
</b>
</td>
</tr>
<tr>
<td align="center">
<font size="2">
13.6
</font>
</td>
<td align="center">
<font size="2">
13.6
</font>
</td>
<td align="center">
<font size="2">
18.0
</font>
</td>
<td align="center">
<font size="2">
20.2
</font>
</td>
<td align="center">
<font size="2">
20.2
</font>
</td>
<td align="center" colspan="1">
<font size="2">
13.6
</font>
</td>
<td align="center" colspan="1">
<font size="2">
70.0
</font>
</td>
<td align="center">
<font size="2">
4
</font>
</td>
<td align="center">
<font size="2">
10
</font>
</td>
</tr>
</table>
from BeautifulSoup import BeautifulSoup html = ''' PASTE YOUR HTML HERE ''' bs = BeautifulSoup(html) csv = '' for table in bs.findAll('table'): for row in table.findChildren('tr'): for cell in row.findChildren('th')+row.findChildren('td'): csv += '"'+cell.text.replace('\r','').replace('\n','')+'"'+(','*(int(cell['colspan'])-1) if cell.has_key('colspan') else '')+',' if len(row) > 0: csv += '\n' with open('test.csv','w') as f: f.write(csv.encode('utf-8'))
您可以像这样遍历 table 和每个 table 中的标签:
soup = BeautifulSoup(<your html>)
for tbl in soup.find_all('table'):
for td in tbl.find_all('td'):
# do things with td
print td.text.strip()