使用 python 解析 html 页面中包含的几个 table 中的一个 table

Question

我正在尝试在 this link 的 html 页面中解析 table，但我还没有找到一种方法确保我可以指向正确的 table ]，因为该页面还包含其他一些 table - 如所附图片所示。

我已经尝试了更简单的方法，使用 pandas.read_html 让它自己解决，但这只是 returns 页面顶部的内容（我猜的），遗漏了所有内容否则。

import pandas as pd
url='https://www.360optimi.com/app/sec/resourceType/benchmarkGraph?resourceSubTypeId=5c9316b28e202b46c92ca518&resourceId=envdecAluminumWindowProfAl&profileId=Saray2016&benchmarkToShow=co2_cml&entityId=5e4eae0f619e783ceb5d0732&indicatorId=lcaForLevels-CO2&stateIdOfProject='
tables = pd.read_html(url)
print(tables[0])

哪个returns:

            0         1         2
0     English  Français   Deutsch
1     Español     Suomi     Norsk
2  Nederlands   Svenska  Italiano

知道如何使用正确的 html 标签指向感兴趣的 table 吗？

编辑：正如你们中的一些人注意到网页需要登录凭据（抱歉），我已经上传了 html 代码 here。

Answer 1

我已将您提供的 html 作为输入。如果您想在 url 上使用此代码，只需在使用此代码

之前提取 url 的 html

from bs4 import BeautifulSoup
import pandas as pd

Your_input_html_string = str(html_code_of_your_url)

soup = BeautifulSoup(Your_input_html_string) #Provide the html code of the url in string format as input over here

#The table id which you want to extract from this html is "resourceBenchmarkTable". So let's extract the html of this table alone from the entire html
extracted_table_html = str(soup.find_all("table",id="resourceBenchmarkTable"))

#Now, convert the specific extracted html of table into pandas dataframe
table_dataframe = pd.read_html(extracted_table_html)

print(table_dataframe)

输出：（仅显示前 5 行以使答案简短）

Answer 2

所以，我编辑了@KarthickMohanraj 提供的代码，以实现第一步，即读取本地保存的 html 文件。最终代码如下：

from bs4 import BeautifulSoup
import pandas as pd

# opens html file saved locally
filepath = 'Aluminium_Profiles_profiles.html'
f = open(filepath, 'r', encoding='utf8', errors='ignore')

# reads html code as string
s = f.read()

# parse html string with BeautifulSoup
soup = BeautifulSoup(s) #Provide the html code of the url in string format as input over here

# The table id which you want to extract from this html is "resourceBenchmarkTable".
# So let's extract the html of this table alone from the entire html
extracted_table_html = str(soup.find_all("table",id="resourceBenchmarkTable"))

#Now, convert the specific extracted html of table into pandas dataframe
table_df = pd.read_html(extracted_table_html)[0]

使用 python 解析 html 页面中包含的几个 table 中的一个 table

Parsing one table out of several tables contained in a html page using python

python

html-table

html-parsing

pandas