使用 python 解析 html 页面中包含的几个 table 中的一个 table
Parsing one table out of several tables contained in a html page using python
我正在尝试在 this link 的 html 页面中解析 table,但我还没有找到一种方法确保我可以指向正确的 table ],因为该页面还包含其他一些 table - 如所附图片所示。
我已经尝试了更简单的方法,使用 pandas.read_html 让它自己解决,但这只是 returns 页面顶部的内容(我猜的),遗漏了所有内容否则。
import pandas as pd
url='https://www.360optimi.com/app/sec/resourceType/benchmarkGraph?resourceSubTypeId=5c9316b28e202b46c92ca518&resourceId=envdecAluminumWindowProfAl&profileId=Saray2016&benchmarkToShow=co2_cml&entityId=5e4eae0f619e783ceb5d0732&indicatorId=lcaForLevels-CO2&stateIdOfProject='
tables = pd.read_html(url)
print(tables[0])
哪个returns:
0 1 2
0 English Français Deutsch
1 Español Suomi Norsk
2 Nederlands Svenska Italiano
知道如何使用正确的 html 标签指向感兴趣的 table 吗?
编辑:
正如你们中的一些人注意到网页需要登录凭据(抱歉),我已经上传了 html 代码 here。
我已将您提供的 html 作为输入。如果您想在 url 上使用此代码,只需在使用此代码
之前提取 url 的 html
from bs4 import BeautifulSoup
import pandas as pd
Your_input_html_string = str(html_code_of_your_url)
soup = BeautifulSoup(Your_input_html_string) #Provide the html code of the url in string format as input over here
#The table id which you want to extract from this html is "resourceBenchmarkTable". So let's extract the html of this table alone from the entire html
extracted_table_html = str(soup.find_all("table",id="resourceBenchmarkTable"))
#Now, convert the specific extracted html of table into pandas dataframe
table_dataframe = pd.read_html(extracted_table_html)
print(table_dataframe)
输出:(仅显示前 5 行以使答案简短)
所以,我编辑了@KarthickMohanraj 提供的代码,以实现第一步,即读取本地保存的 html
文件。最终代码如下:
from bs4 import BeautifulSoup
import pandas as pd
# opens html file saved locally
filepath = 'Aluminium_Profiles_profiles.html'
f = open(filepath, 'r', encoding='utf8', errors='ignore')
# reads html code as string
s = f.read()
# parse html string with BeautifulSoup
soup = BeautifulSoup(s) #Provide the html code of the url in string format as input over here
# The table id which you want to extract from this html is "resourceBenchmarkTable".
# So let's extract the html of this table alone from the entire html
extracted_table_html = str(soup.find_all("table",id="resourceBenchmarkTable"))
#Now, convert the specific extracted html of table into pandas dataframe
table_df = pd.read_html(extracted_table_html)[0]
我正在尝试在 this link 的 html 页面中解析 table,但我还没有找到一种方法确保我可以指向正确的 table ],因为该页面还包含其他一些 table - 如所附图片所示。
我已经尝试了更简单的方法,使用 pandas.read_html 让它自己解决,但这只是 returns 页面顶部的内容(我猜的),遗漏了所有内容否则。
import pandas as pd
url='https://www.360optimi.com/app/sec/resourceType/benchmarkGraph?resourceSubTypeId=5c9316b28e202b46c92ca518&resourceId=envdecAluminumWindowProfAl&profileId=Saray2016&benchmarkToShow=co2_cml&entityId=5e4eae0f619e783ceb5d0732&indicatorId=lcaForLevels-CO2&stateIdOfProject='
tables = pd.read_html(url)
print(tables[0])
哪个returns:
0 1 2
0 English Français Deutsch
1 Español Suomi Norsk
2 Nederlands Svenska Italiano
知道如何使用正确的 html 标签指向感兴趣的 table 吗?
编辑: 正如你们中的一些人注意到网页需要登录凭据(抱歉),我已经上传了 html 代码 here。
我已将您提供的 html 作为输入。如果您想在 url 上使用此代码,只需在使用此代码
之前提取 url 的 htmlfrom bs4 import BeautifulSoup
import pandas as pd
Your_input_html_string = str(html_code_of_your_url)
soup = BeautifulSoup(Your_input_html_string) #Provide the html code of the url in string format as input over here
#The table id which you want to extract from this html is "resourceBenchmarkTable". So let's extract the html of this table alone from the entire html
extracted_table_html = str(soup.find_all("table",id="resourceBenchmarkTable"))
#Now, convert the specific extracted html of table into pandas dataframe
table_dataframe = pd.read_html(extracted_table_html)
print(table_dataframe)
输出:(仅显示前 5 行以使答案简短)
所以,我编辑了@KarthickMohanraj 提供的代码,以实现第一步,即读取本地保存的 html
文件。最终代码如下:
from bs4 import BeautifulSoup
import pandas as pd
# opens html file saved locally
filepath = 'Aluminium_Profiles_profiles.html'
f = open(filepath, 'r', encoding='utf8', errors='ignore')
# reads html code as string
s = f.read()
# parse html string with BeautifulSoup
soup = BeautifulSoup(s) #Provide the html code of the url in string format as input over here
# The table id which you want to extract from this html is "resourceBenchmarkTable".
# So let's extract the html of this table alone from the entire html
extracted_table_html = str(soup.find_all("table",id="resourceBenchmarkTable"))
#Now, convert the specific extracted html of table into pandas dataframe
table_df = pd.read_html(extracted_table_html)[0]