如何使用 Beautiful Soup 从 HTML table 中抓取图标
How to scrape icons from an HTML table using Beautiful Soup
我正在尝试在 markets.ft 网站上抓取一个 table 不幸的是,其中有许多图标 (table: 'Lipper Leader Scorecard' - https://markets.ft.com/data/funds/tearsheet/ratings?s=LU0526609390:EUR).
当我使用 BeautifulSoup 时,我可以获取 table 但所有值都是 NaN。
有没有办法把table里面的图标刮下来,然后转换成数字?
我的代码是:
import requests
import pandas as pd
from bs4 import BeautifulSoup
id_list = ['LU0526609390:EUR','IE00BHBX0Z19:EUR', 'LU1076093779:EUR', 'LU1116896363:EUR', 'LU1116896876:EUR']
urls = ['https://markets.ft.com/data/funds/tearsheet/ratings?s='+ x for x in id_list]
dfs =[]
for url in urls:
r = requests.get(url).content
soup = BeautifulSoup(r, 'html.parser')
# Some funds in the list do not have any data.
try:
table = soup.find_all('table')[0]
print(table)
except Exception:
continue
df = pd.read_html(str(table), index_col=0)[0]
dfs.append(df)
print(dfs)
基金 (LU0526609390) 所需的输出:
Total return Consistent return Preservation Expense
Overall rating 3 3 5 5
3 year rating 3 3 5 5
5 year rating 2 3 5 5
您可以使用字典将 class 值转换为相应的整数
import requests, bs4
import pandas as pd
from io import StringIO
options = {
'mod-sprite-lipper-1': 1,
'mod-sprite-lipper-2': 2,
'mod-sprite-lipper-3': 3,
'mod-sprite-lipper-4': 4,
'mod-sprite-lipper-5': 5,
}
soup = bs4.BeautifulSoup(requests.get(
url= 'https://markets.ft.com/data/funds/tearsheet/ratings?s=LU0526609390:EUR'
).content, 'html.parser').find('table', {'class': 'mod-ui-table'})
header = [x.text.strip() for x in soup.find('thead').find_all('th')]
data = [header] + [
[x.find('td').text.strip()] + [
options[e.find('i') .get('class')[-1]]
for e in x.find_all('td')[1:]
]
for x in soup.find('tbody').find_all('tr')
]
df = pd.read_csv(
StringIO('\n'.join([','.join(str(x) for x in xs) for xs in data])),
index_col = 0,
)
print(df)
我正在尝试在 markets.ft 网站上抓取一个 table 不幸的是,其中有许多图标 (table: 'Lipper Leader Scorecard' - https://markets.ft.com/data/funds/tearsheet/ratings?s=LU0526609390:EUR).
当我使用 BeautifulSoup 时,我可以获取 table 但所有值都是 NaN。
有没有办法把table里面的图标刮下来,然后转换成数字?
我的代码是:
import requests
import pandas as pd
from bs4 import BeautifulSoup
id_list = ['LU0526609390:EUR','IE00BHBX0Z19:EUR', 'LU1076093779:EUR', 'LU1116896363:EUR', 'LU1116896876:EUR']
urls = ['https://markets.ft.com/data/funds/tearsheet/ratings?s='+ x for x in id_list]
dfs =[]
for url in urls:
r = requests.get(url).content
soup = BeautifulSoup(r, 'html.parser')
# Some funds in the list do not have any data.
try:
table = soup.find_all('table')[0]
print(table)
except Exception:
continue
df = pd.read_html(str(table), index_col=0)[0]
dfs.append(df)
print(dfs)
基金 (LU0526609390) 所需的输出:
Total return Consistent return Preservation Expense
Overall rating 3 3 5 5
3 year rating 3 3 5 5
5 year rating 2 3 5 5
您可以使用字典将 class 值转换为相应的整数
import requests, bs4
import pandas as pd
from io import StringIO
options = {
'mod-sprite-lipper-1': 1,
'mod-sprite-lipper-2': 2,
'mod-sprite-lipper-3': 3,
'mod-sprite-lipper-4': 4,
'mod-sprite-lipper-5': 5,
}
soup = bs4.BeautifulSoup(requests.get(
url= 'https://markets.ft.com/data/funds/tearsheet/ratings?s=LU0526609390:EUR'
).content, 'html.parser').find('table', {'class': 'mod-ui-table'})
header = [x.text.strip() for x in soup.find('thead').find_all('th')]
data = [header] + [
[x.find('td').text.strip()] + [
options[e.find('i') .get('class')[-1]]
for e in x.find_all('td')[1:]
]
for x in soup.find('tbody').find_all('tr')
]
df = pd.read_csv(
StringIO('\n'.join([','.join(str(x) for x in xs) for xs in data])),
index_col = 0,
)
print(df)