Pandas read_html 无法读取表格
Pandas read_html unable to read tables
我正在使用以下代码:
import requests, pandas as pd
from bs4 import BeautifulSoup
if __name__ == '__main__':
url = 'https://www.har.com/homedetail/6408-burgoyne-rd-157-houston-tx-77057/3380601'
list_of_dataframes = pd.read_html(url)
但是,list_of_dataframes
上面url的页面底部没有学校信息。
我想知道如何在数据框中获取以下信息,如下所示:
School Stars Rating
BRIARGROVE Elementary School 4 Good
TANGLEWOOD Middle School 4 Good
WISDOM High School High 3 Average
TIA
您无法使用 pandas
获取学校信息,因为这不是 table。这些只是常规 divs
所以你必须解析 HTML
和 然后 将数据转储到 pd.DataFrame
.
操作方法如下:
import pandas as pd
import requests
from bs4 import BeautifulSoup
if __name__ == '__main__':
url = 'https://www.har.com/homedetail/6408-burgoyne-rd-157-houston-tx-77057/3380601'
soup = BeautifulSoup(requests.get(url).text, "lxml").find("div", {"id": "SCHOOLS"})
schools = soup.find_all("div", class_="border_row")
schools_data = []
for school in schools:
name = school.find("a").getText()
stars = len([i for i in school.find_all("img") if "star" in i["src"]])
rating = school.getText().split()[-2]
schools_data.append(
[
name,
stars,
rating,
]
)
print(pd.DataFrame(schools_data, columns=["School", "Stars", "Rating"]))
输出:
School Stars Rating
0 BRIARGROVE Elementary School 4 Good
1 TANGLEWOOD Middle School 4 Good
2 WISDOM High School 3 Average
我正在使用以下代码:
import requests, pandas as pd
from bs4 import BeautifulSoup
if __name__ == '__main__':
url = 'https://www.har.com/homedetail/6408-burgoyne-rd-157-houston-tx-77057/3380601'
list_of_dataframes = pd.read_html(url)
但是,list_of_dataframes
上面url的页面底部没有学校信息。
我想知道如何在数据框中获取以下信息,如下所示:
School Stars Rating
BRIARGROVE Elementary School 4 Good
TANGLEWOOD Middle School 4 Good
WISDOM High School High 3 Average
TIA
您无法使用 pandas
获取学校信息,因为这不是 table。这些只是常规 divs
所以你必须解析 HTML
和 然后 将数据转储到 pd.DataFrame
.
操作方法如下:
import pandas as pd
import requests
from bs4 import BeautifulSoup
if __name__ == '__main__':
url = 'https://www.har.com/homedetail/6408-burgoyne-rd-157-houston-tx-77057/3380601'
soup = BeautifulSoup(requests.get(url).text, "lxml").find("div", {"id": "SCHOOLS"})
schools = soup.find_all("div", class_="border_row")
schools_data = []
for school in schools:
name = school.find("a").getText()
stars = len([i for i in school.find_all("img") if "star" in i["src"]])
rating = school.getText().split()[-2]
schools_data.append(
[
name,
stars,
rating,
]
)
print(pd.DataFrame(schools_data, columns=["School", "Stars", "Rating"]))
输出:
School Stars Rating
0 BRIARGROVE Elementary School 4 Good
1 TANGLEWOOD Middle School 4 Good
2 WISDOM High School 3 Average