Python 和网络抓取的新手。抓取一个 HTML table- 但是它没有显示所有列
New to Python and Web-scraping. Scraping an HTML table- however it's not displaying all columns
我正在使用 BeautifulSoup 并尝试废弃 HTML table。我只对第一个 table 感兴趣。但是,输出缺少一列值 - “条目”列。不确定我在这里做错了什么。
这是我的代码:
import requests
from bs4 import BeautifulSoup
URL = "http://www.godaycare.com/child-care-cost/saskatchewan"
page = requests.get(URL)
soup = BeautifulSoup(page.content, "html.parser")
table = soup.find_all('table')[0]
for child in soup.find_all('table')[0].children:
for td in child:
print(td.text)
这是输出
TypeAge Cat.SpotAVG. Cost ($)Entries
LicensedInfantFull-Time751.02717
LicensedInfantPart-Time41.31187
UnlicensedInfantFull-Time699.56287
UnlicensedInfantPart-Time31.0550
LicensedToddlerFull-Time661.04604
LicensedToddlerPart-Time32.69148
UnlicensedToddlerFull-Time633.01342
UnlicensedToddlerPart-Time35.9969
LicensedPreschoolFull-Time595.45327
LicensedPreschoolPart-Time30.8566
UnlicensedPreschoolFull-Time602.82195
UnlicensedPreschoolPart-Time30.3330
LicensedKindergartenFull-Time562.8787
LicensedKindergartenPart-Time28.2938
UnlicensedKindergartenFull-Time549.1257
UnlicensedKindergartenPart-Time23.0113
LicensedSchoolageFull-Time605.3494
LicensedSchoolagePart-Time25.4533
UnlicensedSchoolageFull-Time434.9098
UnlicensedSchoolagePart-Time19.0025
阅读第一个 table 的最简单方法是使用 pandas.read_html
:
import pandas as pd
url = "http://www.godaycare.com/child-care-cost/saskatchewan"
df = pd.read_html(url)[0]
print(df.to_markdown())
打印:
Type
Age Cat.
Spot
AVG. Cost ($)
Entries
0
Licensed
Infant
Full-Time
751.02
717
1
Licensed
Infant
Part-Time
41.31
187
2
Unlicensed
Infant
Full-Time
699.56
287
3
Unlicensed
Infant
Part-Time
31.05
50
4
Licensed
Toddler
Full-Time
661.04
604
5
Licensed
Toddler
Part-Time
32.69
148
6
Unlicensed
Toddler
Full-Time
633.01
342
7
Unlicensed
Toddler
Part-Time
35.99
69
8
Licensed
Preschool
Full-Time
595.45
327
9
Licensed
Preschool
Part-Time
30.85
66
10
Unlicensed
Preschool
Full-Time
602.82
195
11
Unlicensed
Preschool
Part-Time
30.33
30
12
Licensed
Kindergarten
Full-Time
562.87
87
13
Licensed
Kindergarten
Part-Time
28.29
38
14
Unlicensed
Kindergarten
Full-Time
549.12
57
15
Unlicensed
Kindergarten
Part-Time
23.01
13
16
Licensed
Schoolage
Full-Time
605.34
94
17
Licensed
Schoolage
Part-Time
25.45
33
18
Unlicensed
Schoolage
Full-Time
434.9
98
19
Unlicensed
Schoolage
Part-Time
19
25
编辑:beautifulsoup
版本:
import requests
from bs4 import BeautifulSoup
URL = "http://www.godaycare.com/child-care-cost/saskatchewan"
page = requests.get(URL)
soup = BeautifulSoup(page.content, "html.parser")
for row in soup.find("table").find_all("tr"):
tds = [td.text for td in row.find_all(["td", "th"])]
print(("{:<20}" * len(tds)).format(*tds))
打印:
Type Age Cat. Spot AVG. Cost ($) Entries
Licensed Infant Full-Time 751.02 717
Licensed Infant Part-Time 41.31 187
Unlicensed Infant Full-Time 699.56 287
Unlicensed Infant Part-Time 31.05 50
Licensed Toddler Full-Time 661.04 604
Licensed Toddler Part-Time 32.69 148
Unlicensed Toddler Full-Time 633.01 342
Unlicensed Toddler Part-Time 35.99 69
Licensed Preschool Full-Time 595.45 327
Licensed Preschool Part-Time 30.85 66
Unlicensed Preschool Full-Time 602.82 195
Unlicensed Preschool Part-Time 30.33 30
Licensed Kindergarten Full-Time 562.87 87
Licensed Kindergarten Part-Time 28.29 38
Unlicensed Kindergarten Full-Time 549.12 57
Unlicensed Kindergarten Part-Time 23.01 13
Licensed Schoolage Full-Time 605.34 94
Licensed Schoolage Part-Time 25.45 33
Unlicensed Schoolage Full-Time 434.90 98
Unlicensed Schoolage Part-Time 19.00 25
我正在使用 BeautifulSoup 并尝试废弃 HTML table。我只对第一个 table 感兴趣。但是,输出缺少一列值 - “条目”列。不确定我在这里做错了什么。
这是我的代码:
import requests
from bs4 import BeautifulSoup
URL = "http://www.godaycare.com/child-care-cost/saskatchewan"
page = requests.get(URL)
soup = BeautifulSoup(page.content, "html.parser")
table = soup.find_all('table')[0]
for child in soup.find_all('table')[0].children:
for td in child:
print(td.text)
这是输出
TypeAge Cat.SpotAVG. Cost ($)Entries
LicensedInfantFull-Time751.02717
LicensedInfantPart-Time41.31187
UnlicensedInfantFull-Time699.56287
UnlicensedInfantPart-Time31.0550
LicensedToddlerFull-Time661.04604
LicensedToddlerPart-Time32.69148
UnlicensedToddlerFull-Time633.01342
UnlicensedToddlerPart-Time35.9969
LicensedPreschoolFull-Time595.45327
LicensedPreschoolPart-Time30.8566
UnlicensedPreschoolFull-Time602.82195
UnlicensedPreschoolPart-Time30.3330
LicensedKindergartenFull-Time562.8787
LicensedKindergartenPart-Time28.2938
UnlicensedKindergartenFull-Time549.1257
UnlicensedKindergartenPart-Time23.0113
LicensedSchoolageFull-Time605.3494
LicensedSchoolagePart-Time25.4533
UnlicensedSchoolageFull-Time434.9098
UnlicensedSchoolagePart-Time19.0025
阅读第一个 table 的最简单方法是使用 pandas.read_html
:
import pandas as pd
url = "http://www.godaycare.com/child-care-cost/saskatchewan"
df = pd.read_html(url)[0]
print(df.to_markdown())
打印:
Type | Age Cat. | Spot | AVG. Cost ($) | Entries | |
---|---|---|---|---|---|
0 | Licensed | Infant | Full-Time | 751.02 | 717 |
1 | Licensed | Infant | Part-Time | 41.31 | 187 |
2 | Unlicensed | Infant | Full-Time | 699.56 | 287 |
3 | Unlicensed | Infant | Part-Time | 31.05 | 50 |
4 | Licensed | Toddler | Full-Time | 661.04 | 604 |
5 | Licensed | Toddler | Part-Time | 32.69 | 148 |
6 | Unlicensed | Toddler | Full-Time | 633.01 | 342 |
7 | Unlicensed | Toddler | Part-Time | 35.99 | 69 |
8 | Licensed | Preschool | Full-Time | 595.45 | 327 |
9 | Licensed | Preschool | Part-Time | 30.85 | 66 |
10 | Unlicensed | Preschool | Full-Time | 602.82 | 195 |
11 | Unlicensed | Preschool | Part-Time | 30.33 | 30 |
12 | Licensed | Kindergarten | Full-Time | 562.87 | 87 |
13 | Licensed | Kindergarten | Part-Time | 28.29 | 38 |
14 | Unlicensed | Kindergarten | Full-Time | 549.12 | 57 |
15 | Unlicensed | Kindergarten | Part-Time | 23.01 | 13 |
16 | Licensed | Schoolage | Full-Time | 605.34 | 94 |
17 | Licensed | Schoolage | Part-Time | 25.45 | 33 |
18 | Unlicensed | Schoolage | Full-Time | 434.9 | 98 |
19 | Unlicensed | Schoolage | Part-Time | 19 | 25 |
编辑:beautifulsoup
版本:
import requests
from bs4 import BeautifulSoup
URL = "http://www.godaycare.com/child-care-cost/saskatchewan"
page = requests.get(URL)
soup = BeautifulSoup(page.content, "html.parser")
for row in soup.find("table").find_all("tr"):
tds = [td.text for td in row.find_all(["td", "th"])]
print(("{:<20}" * len(tds)).format(*tds))
打印:
Type Age Cat. Spot AVG. Cost ($) Entries
Licensed Infant Full-Time 751.02 717
Licensed Infant Part-Time 41.31 187
Unlicensed Infant Full-Time 699.56 287
Unlicensed Infant Part-Time 31.05 50
Licensed Toddler Full-Time 661.04 604
Licensed Toddler Part-Time 32.69 148
Unlicensed Toddler Full-Time 633.01 342
Unlicensed Toddler Part-Time 35.99 69
Licensed Preschool Full-Time 595.45 327
Licensed Preschool Part-Time 30.85 66
Unlicensed Preschool Full-Time 602.82 195
Unlicensed Preschool Part-Time 30.33 30
Licensed Kindergarten Full-Time 562.87 87
Licensed Kindergarten Part-Time 28.29 38
Unlicensed Kindergarten Full-Time 549.12 57
Unlicensed Kindergarten Part-Time 23.01 13
Licensed Schoolage Full-Time 605.34 94
Licensed Schoolage Part-Time 25.45 33
Unlicensed Schoolage Full-Time 434.90 98
Unlicensed Schoolage Part-Time 19.00 25