解析 BeautifulSoup 的公司列表时出现问题

Problem parsing list of companies with BeautifulSoup

我可以使用以下代码解析 S&P500 公司列表:

import requests
from bs4 import BeautifulSoup
import pandas as pd
import xlwings as xw

def get_sp500_info():
    resp = requests.get("https://en.wikipedia.org/wiki/List_of_S%26P_500_companies")
    soup = BeautifulSoup(resp.text, 'lxml')
    stocks_info = []
    tickers = []
    securities = []
    gics_industries = []
    gics_sub_industries = []
    table = soup.find('table', {'class': 'wikitable sortable'})
    
    for row in table.findAll('tr')[1:]:
        ticker = row.findAll('td')[0].text
        security = row.findAll('td')[1].text
        gics_industry = row.findAll('td')[3].text
        gics_sub_industry = row.findAll('td')[4].text
    
        tickers.append(ticker.lower().replace(r"\n", " "))
        securities.append(security)
        gics_industries.append(gics_industry.lower())
        gics_sub_industries.append(gics_sub_industry.lower())
    
    stocks_info.append(tickers)
    stocks_info.append(securities)
    stocks_info.append(gics_industries)
    stocks_info.append(gics_sub_industries)
    
    stocks_info_df = pd.DataFrame(stocks_info).T
    stocks_info_df.columns=['tickers','security','gics_industry','gics_sub_industry']
    stocks_info_df['seclabels'] = 'SP500'
    return stocks_info_df

def open_in_excel(dataframe):
    xw.view(dataframe)

if __name__ == "__main__":
    open_in_excel(get_sp500_info())

现在我想解析与上述代码基本相同的Russel3000公司列表。而且它不起作用。

import requests
from bs4 import BeautifulSoup
import pandas as pd
import xlwings as xw

def get_russel3000_info():
    resp = requests.get("https://www.ishares.com/us/products/239714/ishares-russell-3000-etf#holdings")
    soup = BeautifulSoup(resp.text, "lxml")
    stocks_info = []
    tickers = []
    securities = []
    gics_industries = []
    
    table = soup.find('table', {'class': 'display product-table border-row dataTable no-footer'})

    for row in table.findAll('tr')[1:]:           #Line A
        ticker = row.findAll('td')[0].text
        security = row.findAll('td')[1].text
        gics_industry = row.findAll('td')[2].text

        tickers.append(ticker.lower().replace(r"\n", " "))
        securities.append(security)
        gics_industries.append(gics_industry.lower())
        
    stocks_info.append(tickers)
    stocks_info.append(securities)
    stocks_info.append(gics_industries)
    
    stocks_info_df = pd.DataFrame(stocks_info).T
    stocks_info_df.columns=['tickers','security','gics_industry']
    stocks_info_df['seclabels'] = 'Russel3000'
    return stocks_info_df

def open_in_excel(dataframe):
    xw.view(dataframe)

if __name__ == "__main__":
    open_in_excel(get_russel3000_info())

我不明白为什么它适用于 S&P500 但不适用于 Russel3000。 在“A 行”,我会收到以下错误:

Exception has occurred: AttributeError
'NoneType' object has no attribute 'findAll'

不应该 return“None”。 感谢您的指点:-)

您可以将 tables 直接加载到 pandas:

df = pd.read_html("https://en.wikipedia.org/wiki/List_of_S%26P_500_companies")

您可以使用 df[0]df[1] 等访问页面上的 tables。在 ishares.com 的情况下,特定的 table 不会'加载,因为它是通过 javascript 在本地加载的。一种解决方案是使用 Selenium 来完成这项工作:

from selenium import webdriver
import pandas as pd
import time

options = webdriver.ChromeOptions()
options.add_argument('--headless')
options.add_argument('--no-sandbox')
options.add_argument('--disable-dev-shm-usage')

url="https://www.ishares.com/us/products/239714/ishares-russell-3000-etf#holdings"
wd = webdriver.Chrome('chromedriver',options=options)
wd.get(url)
time.sleep(5) # sleep for a few seconds to allow loading the data
df = pd.read_html(wd.page_source)

df[7] 是您正在寻找的table:

Ticker Name Sector Asset Class Market Value Weight (%) Notional Value Shares CUSIP ISIN SEDOL Accrual Date
0 AAPL APPLE INC Information Technology Equity 0,367,328.56 5.16 5.60367e+08 4.38506e+06 037833100 US0378331005 2046251 -
1 MSFT MICROSOFT CORP Information Technology Equity 2,112,717.24 4.44 4.82113e+08 2.03475e+06 594918104 US5949181045 2588173 -
2 AMZN AMAZON COM INC Consumer Discretionary Equity 2,479,373.96 3.34 3.62479e+08 115214 023135106 US0231351067 2000019 -
3 FB FACEBOOK CLASS A INC Communication Equity 2,844,238.24 1.59 1.72844e+08 652464 30303M102 US30303M1027 B7TL820 -
4 GOOGL ALPHABET INC CLASS A Communication Equity 8,815,957.22 1.55 1.68816e+08 81567 02079K305 US02079K3059 BYVY8G0 -

更好的解决办法是直接加载json文件。正如您在 Firefox 或 Chrome 中检查网站时所看到的,table 数据是从此 json url: https://www.ishares.com/us/products/239714/ishares-russell-3000-etf/1467271812596.ajax?tab=all&fileType=json 加载的。将其加载到 pandas 的好处是可以一次性将完整的 2866 个条目放入数据框中。我们无法将它直接加载到 pandas,因为该文件包含一个 UTF-8 BOM header,但这会起作用:

import requests
import json
import pandas as pd

url = "https://www.ishares.com/us/products/239714/ishares-russell-3000-etf/1467271812596.ajax?tab=all&fileType=json"
r = requests.get(url)
json = json.loads(r.content.decode('utf-8-sig'))
df = pd.DataFrame(json['aaData'])

输出:

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17
0 AAPL APPLE INC Information Technology Equity {'display': '0,367,328.56', 'raw': 560367328.56} {'display': '5.16', 'raw': 5.15741} {'display': '560,367,328.56', 'raw': 560367328.56} {'display': '4,385,064.00', 'raw': 4385064} 037833100 US0378331005 2046251 {'display': '127.79', 'raw': 127.79} United States NASDAQ USD 1 USD -
1 MSFT MICROSOFT CORP Information Technology Equity {'display': '2,112,717.24', 'raw': 482112717.24} {'display': '4.44', 'raw': 4.43718} {'display': '482,112,717.24', 'raw': 482112717.24} {'display': '2,034,746.00', 'raw': 2034746} 594918104 US5949181045 2588173 {'display': '236.94', 'raw': 236.94} United States NASDAQ USD 1 USD -
2 AMZN AMAZON COM INC Consumer Discretionary Equity {'display': '2,479,373.96', 'raw': 362479373.96} {'display': '3.34', 'raw': 3.33612} {'display': '362,479,373.96', 'raw': 362479373.96} {'display': '115,214.00', 'raw': 115214} 023135106 US0231351067 2000019 {'display': '3,146.14', 'raw': 3146.14} United States NASDAQ USD 1 USD -
3 FB FACEBOOK CLASS A INC Communication Equity {'display': '2,844,238.24', 'raw': 172844238.24} {'display': '1.59', 'raw': 1.59079} {'display': '172,844,238.24', 'raw': 172844238.24} {'display': '652,464.00', 'raw': 652464} 30303M102 US30303M1027 B7TL820 {'display': '264.91', 'raw': 264.91} United States NASDAQ USD 1 USD -
4 GOOGL ALPHABET INC CLASS A Communication Equity {'display': '8,815,957.22', 'raw': 168815957.22} {'display': '1.55', 'raw': 1.55372} {'display': '168,815,957.22', 'raw': 168815957.22} {'display': '81,567.00', 'raw': 81567} 02079K305 US02079K3059 BYVY8G0 {'display': '2,069.66', 'raw': 2069.66} United States NASDAQ USD 1 USD -