如何使用 BeautifulSoup 和 return 作为 pandas 数据框从 table 中抓取特定列
How to scrape specific columns from table with BeautifulSoup and return as pandas dataframe
尝试使用 HDI 解析 table 并将数据加载到 Pandas DataFrame 中,列为:国家/地区,HDI_score。
我无法使用以下代码加载国家列:
import requests
import pandas as pd
from bs4 import BeautifulSoup
html = requests.get("https://en.wikipedia.org/wiki/List_of_countries_by_Human_Development_Index")
bsObj = BeautifulSoup(html.text, 'html.parser')
df = pd.DataFrame(columns=['Countries', 'HDI_score'])
for row in table.find_all('tr'):
columns = row.find_all('td')
if(columns != []):
countries = columns[1].text.strip()
hdi_score = columns[2].text.strip()
df = df.append({'Countries': countries, 'HDI_score': hdi_score}, ignore_index=True)
Result from my code
因此,我没有国家名称,而是来自列 'Rank changes over 5 years' 的值。我试过更改列的索引,但没有用。
您可以使用 pandas 获取适当的 table,match='Rank'
获取正确的 table,然后提取感兴趣的列。
import pandas as pd
table = pd.read_html('https://en.wikipedia.org/wiki/List_of_countries_by_Human_Development_Index', match='Rank')[0]
columns = ['Nation','HDI']
table = table.loc[:, columns].iloc[:, :2]
table.columns = columns
print(table)
根据评论,如果您仍在使用 pandas,我认为涉及 bs4 的意义不大。见下图:
import pandas as pd
from bs4 import BeautifulSoup as bs
r = requests.get('https://en.wikipedia.org/wiki/List_of_countries_by_Human_Development_Index')
soup = bs(r.content, 'lxml')
table = pd.read_html(str(soup.select_one('table:has(th:contains("Rank"))')))[0]
columns = ['Nation','HDI']
table = table.loc[:, columns].iloc[:, :2]
table.columns = columns
print(table)
注意 投票给 QHarr 因为在我看来它也是使用 pandas
最直接的解决方案
另外 并回答您的问题 - 也可以仅通过 BeautifulSoup
选择列 。只需组合 css selectors
和 stripped_strings
.
例子
import requests
import pandas as pd
from bs4 import BeautifulSoup
html = requests.get("https://en.wikipedia.org/wiki/List_of_countries_by_Human_Development_Index")
bsObj = BeautifulSoup(html.text, 'html.parser')
pd.DataFrame(
[list(r.stripped_strings)[-3:-1] for r in bsObj.select('table tr:has(span[data-sort-value])')],
columns=['Countries', 'HDI_score']
)
输出
Countries
HDI_score
Norway
0.957
Ireland
0.955
Switzerland
0.955
...
...
尝试使用 HDI 解析 table 并将数据加载到 Pandas DataFrame 中,列为:国家/地区,HDI_score。
我无法使用以下代码加载国家列:
import requests
import pandas as pd
from bs4 import BeautifulSoup
html = requests.get("https://en.wikipedia.org/wiki/List_of_countries_by_Human_Development_Index")
bsObj = BeautifulSoup(html.text, 'html.parser')
df = pd.DataFrame(columns=['Countries', 'HDI_score'])
for row in table.find_all('tr'):
columns = row.find_all('td')
if(columns != []):
countries = columns[1].text.strip()
hdi_score = columns[2].text.strip()
df = df.append({'Countries': countries, 'HDI_score': hdi_score}, ignore_index=True)
Result from my code
因此,我没有国家名称,而是来自列 'Rank changes over 5 years' 的值。我试过更改列的索引,但没有用。
您可以使用 pandas 获取适当的 table,match='Rank'
获取正确的 table,然后提取感兴趣的列。
import pandas as pd
table = pd.read_html('https://en.wikipedia.org/wiki/List_of_countries_by_Human_Development_Index', match='Rank')[0]
columns = ['Nation','HDI']
table = table.loc[:, columns].iloc[:, :2]
table.columns = columns
print(table)
根据评论,如果您仍在使用 pandas,我认为涉及 bs4 的意义不大。见下图:
import pandas as pd
from bs4 import BeautifulSoup as bs
r = requests.get('https://en.wikipedia.org/wiki/List_of_countries_by_Human_Development_Index')
soup = bs(r.content, 'lxml')
table = pd.read_html(str(soup.select_one('table:has(th:contains("Rank"))')))[0]
columns = ['Nation','HDI']
table = table.loc[:, columns].iloc[:, :2]
table.columns = columns
print(table)
注意 投票给 QHarr 因为在我看来它也是使用 pandas
最直接的解决方案
另外 并回答您的问题 - 也可以仅通过 BeautifulSoup
选择列 。只需组合 css selectors
和 stripped_strings
.
例子
import requests
import pandas as pd
from bs4 import BeautifulSoup
html = requests.get("https://en.wikipedia.org/wiki/List_of_countries_by_Human_Development_Index")
bsObj = BeautifulSoup(html.text, 'html.parser')
pd.DataFrame(
[list(r.stripped_strings)[-3:-1] for r in bsObj.select('table tr:has(span[data-sort-value])')],
columns=['Countries', 'HDI_score']
)
输出
Countries | HDI_score |
---|---|
Norway | 0.957 |
Ireland | 0.955 |
Switzerland | 0.955 |
... | ... |