通过 Beautifulsoup 从 Morningstar 抓取数据
Scrape Data Off Morningstar via Beautifulsoup
例如,我想从 https://www.morningstar.com/funds/xnas/aepfx/portfolio 中提取“持股”中的所有值。其中一些值是:
- 当前投资组合日期 = 2022 年 3 月 31 日
- 持股 = 384
我尝试了一些不同的方法,但 none 似乎有效。
1st) 尝试过:
soup.find_all("div", class_="sal-dp-value")
但这会return空
奇怪的是我什至没有找到
<div class="sal-dp-value">Mar 31, 2022</div>
搜索由以下人员打印的原始数据时:
import requests
r = requests.get('https://www.morningstar.com/funds/xnas/aepfx/portfolio')
soup = BeautifulSoup(r.text, "html.parser")
soup.html
不太理想,因为我更喜欢使用 Beautifulsoup,但也通过 Xpath 尝试过:
import requests
from lxml import html
page = requests.get("https://www.morningstar.com/funds/xnas/aepfx/portfolio").text
holdings = html.fromstring(page).xpath('/html/body/div[2]/div/div/div[2]/div[3]/div/main/div[2]/div/div/div[1]/sal-components/section/div/div/div[3]/sal-components-mip-holdings/div/div/div/div[2]/div[1]/ul/li[1]/div/div[2]')
holdings
这将 return 清空。
类似的问题:
由于该站点的内容javascript 很重,bs4 或lxml 看不到内容。相反,请尝试使用以下方法从该站点获取所需的字段:
import requests
link = 'https://api-global.morningstar.com/sal-service/v1/fund/portfolio/holding/v2/FOUSA06WRH/data'
headers = {
'apikey': 'lstzFDEOhfFNMLikKa0am9mgEKLBl49T',
'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/100.0.4896.127 Safari/537.36'
}
payload = {
'premiumNum': '100',
'freeNum': '25',
'languageId': 'en',
'locale': 'en',
'clientId': 'MDC',
'benchmarkId': 'mstarorcat',
'component': 'sal-components-mip-holdings',
'version': '3.59.1'
}
with requests.Session() as s:
s.headers.update(headers)
resp = s.get(link,params=payload)
container = resp.json()
portfolio_date = container['holdingSummary']['portfolioDate']
equity_holding = container['numberOfEquityHolding']
active_share = container['holdingActiveShare']['activeShareValue']
reported_turnover = container['holdingSummary']['lastTurnover']
other_holding = container['holdingSummary']['numberOfOtherHolding']
top_holding = container['holdingSummary']['topHoldingWeighting']
print(portfolio_date,equity_holding,active_share,reported_turnover,other_holding,top_holding)
例如,我想从 https://www.morningstar.com/funds/xnas/aepfx/portfolio 中提取“持股”中的所有值。其中一些值是:
- 当前投资组合日期 = 2022 年 3 月 31 日
- 持股 = 384
我尝试了一些不同的方法,但 none 似乎有效。
1st) 尝试过:
soup.find_all("div", class_="sal-dp-value")
但这会return空
奇怪的是我什至没有找到
<div class="sal-dp-value">Mar 31, 2022</div>
搜索由以下人员打印的原始数据时:
import requests
r = requests.get('https://www.morningstar.com/funds/xnas/aepfx/portfolio')
soup = BeautifulSoup(r.text, "html.parser")
soup.html
不太理想,因为我更喜欢使用 Beautifulsoup,但也通过 Xpath 尝试过:
import requests
from lxml import html
page = requests.get("https://www.morningstar.com/funds/xnas/aepfx/portfolio").text
holdings = html.fromstring(page).xpath('/html/body/div[2]/div/div/div[2]/div[3]/div/main/div[2]/div/div/div[1]/sal-components/section/div/div/div[3]/sal-components-mip-holdings/div/div/div/div[2]/div[1]/ul/li[1]/div/div[2]')
holdings
这将 return 清空。
类似的问题:
由于该站点的内容javascript 很重,bs4 或lxml 看不到内容。相反,请尝试使用以下方法从该站点获取所需的字段:
import requests
link = 'https://api-global.morningstar.com/sal-service/v1/fund/portfolio/holding/v2/FOUSA06WRH/data'
headers = {
'apikey': 'lstzFDEOhfFNMLikKa0am9mgEKLBl49T',
'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/100.0.4896.127 Safari/537.36'
}
payload = {
'premiumNum': '100',
'freeNum': '25',
'languageId': 'en',
'locale': 'en',
'clientId': 'MDC',
'benchmarkId': 'mstarorcat',
'component': 'sal-components-mip-holdings',
'version': '3.59.1'
}
with requests.Session() as s:
s.headers.update(headers)
resp = s.get(link,params=payload)
container = resp.json()
portfolio_date = container['holdingSummary']['portfolioDate']
equity_holding = container['numberOfEquityHolding']
active_share = container['holdingActiveShare']['activeShareValue']
reported_turnover = container['holdingSummary']['lastTurnover']
other_holding = container['holdingSummary']['numberOfOtherHolding']
top_holding = container['holdingSummary']['topHoldingWeighting']
print(portfolio_date,equity_holding,active_share,reported_turnover,other_holding,top_holding)