解析 BeautifulSoup 的公司列表时出现问题
Problem parsing list of companies with BeautifulSoup
我可以使用以下代码解析 S&P500 公司列表:
import requests
from bs4 import BeautifulSoup
import pandas as pd
import xlwings as xw
def get_sp500_info():
resp = requests.get("https://en.wikipedia.org/wiki/List_of_S%26P_500_companies")
soup = BeautifulSoup(resp.text, 'lxml')
stocks_info = []
tickers = []
securities = []
gics_industries = []
gics_sub_industries = []
table = soup.find('table', {'class': 'wikitable sortable'})
for row in table.findAll('tr')[1:]:
ticker = row.findAll('td')[0].text
security = row.findAll('td')[1].text
gics_industry = row.findAll('td')[3].text
gics_sub_industry = row.findAll('td')[4].text
tickers.append(ticker.lower().replace(r"\n", " "))
securities.append(security)
gics_industries.append(gics_industry.lower())
gics_sub_industries.append(gics_sub_industry.lower())
stocks_info.append(tickers)
stocks_info.append(securities)
stocks_info.append(gics_industries)
stocks_info.append(gics_sub_industries)
stocks_info_df = pd.DataFrame(stocks_info).T
stocks_info_df.columns=['tickers','security','gics_industry','gics_sub_industry']
stocks_info_df['seclabels'] = 'SP500'
return stocks_info_df
def open_in_excel(dataframe):
xw.view(dataframe)
if __name__ == "__main__":
open_in_excel(get_sp500_info())
现在我想解析与上述代码基本相同的Russel3000公司列表。而且它不起作用。
import requests
from bs4 import BeautifulSoup
import pandas as pd
import xlwings as xw
def get_russel3000_info():
resp = requests.get("https://www.ishares.com/us/products/239714/ishares-russell-3000-etf#holdings")
soup = BeautifulSoup(resp.text, "lxml")
stocks_info = []
tickers = []
securities = []
gics_industries = []
table = soup.find('table', {'class': 'display product-table border-row dataTable no-footer'})
for row in table.findAll('tr')[1:]: #Line A
ticker = row.findAll('td')[0].text
security = row.findAll('td')[1].text
gics_industry = row.findAll('td')[2].text
tickers.append(ticker.lower().replace(r"\n", " "))
securities.append(security)
gics_industries.append(gics_industry.lower())
stocks_info.append(tickers)
stocks_info.append(securities)
stocks_info.append(gics_industries)
stocks_info_df = pd.DataFrame(stocks_info).T
stocks_info_df.columns=['tickers','security','gics_industry']
stocks_info_df['seclabels'] = 'Russel3000'
return stocks_info_df
def open_in_excel(dataframe):
xw.view(dataframe)
if __name__ == "__main__":
open_in_excel(get_russel3000_info())
我不明白为什么它适用于 S&P500 但不适用于 Russel3000。
在“A 行”,我会收到以下错误:
Exception has occurred: AttributeError
'NoneType' object has no attribute 'findAll'
不应该 return“None”。
感谢您的指点:-)
您可以将 tables 直接加载到 pandas:
df = pd.read_html("https://en.wikipedia.org/wiki/List_of_S%26P_500_companies")
您可以使用 df[0]
、df[1]
等访问页面上的 tables。在 ishares.com
的情况下,特定的 table 不会'加载,因为它是通过 javascript 在本地加载的。一种解决方案是使用 Selenium
来完成这项工作:
from selenium import webdriver
import pandas as pd
import time
options = webdriver.ChromeOptions()
options.add_argument('--headless')
options.add_argument('--no-sandbox')
options.add_argument('--disable-dev-shm-usage')
url="https://www.ishares.com/us/products/239714/ishares-russell-3000-etf#holdings"
wd = webdriver.Chrome('chromedriver',options=options)
wd.get(url)
time.sleep(5) # sleep for a few seconds to allow loading the data
df = pd.read_html(wd.page_source)
df[7]
是您正在寻找的table:
Ticker
Name
Sector
Asset Class
Market Value
Weight (%)
Notional Value
Shares
CUSIP
ISIN
SEDOL
Accrual Date
0
AAPL
APPLE INC
Information Technology
Equity
0,367,328.56
5.16
5.60367e+08
4.38506e+06
037833100
US0378331005
2046251
-
1
MSFT
MICROSOFT CORP
Information Technology
Equity
2,112,717.24
4.44
4.82113e+08
2.03475e+06
594918104
US5949181045
2588173
-
2
AMZN
AMAZON COM INC
Consumer Discretionary
Equity
2,479,373.96
3.34
3.62479e+08
115214
023135106
US0231351067
2000019
-
3
FB
FACEBOOK CLASS A INC
Communication
Equity
2,844,238.24
1.59
1.72844e+08
652464
30303M102
US30303M1027
B7TL820
-
4
GOOGL
ALPHABET INC CLASS A
Communication
Equity
8,815,957.22
1.55
1.68816e+08
81567
02079K305
US02079K3059
BYVY8G0
-
更好的解决办法是直接加载json文件。正如您在 Firefox 或 Chrome 中检查网站时所看到的,table 数据是从此 json url: https://www.ishares.com/us/products/239714/ishares-russell-3000-etf/1467271812596.ajax?tab=all&fileType=json
加载的。将其加载到 pandas 的好处是可以一次性将完整的 2866 个条目放入数据框中。我们无法将它直接加载到 pandas,因为该文件包含一个 UTF-8 BOM header,但这会起作用:
import requests
import json
import pandas as pd
url = "https://www.ishares.com/us/products/239714/ishares-russell-3000-etf/1467271812596.ajax?tab=all&fileType=json"
r = requests.get(url)
json = json.loads(r.content.decode('utf-8-sig'))
df = pd.DataFrame(json['aaData'])
输出:
0
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
0
AAPL
APPLE INC
Information Technology
Equity
{'display': '0,367,328.56', 'raw': 560367328.56}
{'display': '5.16', 'raw': 5.15741}
{'display': '560,367,328.56', 'raw': 560367328.56}
{'display': '4,385,064.00', 'raw': 4385064}
037833100
US0378331005
2046251
{'display': '127.79', 'raw': 127.79}
United States
NASDAQ
USD
1
USD
-
1
MSFT
MICROSOFT CORP
Information Technology
Equity
{'display': '2,112,717.24', 'raw': 482112717.24}
{'display': '4.44', 'raw': 4.43718}
{'display': '482,112,717.24', 'raw': 482112717.24}
{'display': '2,034,746.00', 'raw': 2034746}
594918104
US5949181045
2588173
{'display': '236.94', 'raw': 236.94}
United States
NASDAQ
USD
1
USD
-
2
AMZN
AMAZON COM INC
Consumer Discretionary
Equity
{'display': '2,479,373.96', 'raw': 362479373.96}
{'display': '3.34', 'raw': 3.33612}
{'display': '362,479,373.96', 'raw': 362479373.96}
{'display': '115,214.00', 'raw': 115214}
023135106
US0231351067
2000019
{'display': '3,146.14', 'raw': 3146.14}
United States
NASDAQ
USD
1
USD
-
3
FB
FACEBOOK CLASS A INC
Communication
Equity
{'display': '2,844,238.24', 'raw': 172844238.24}
{'display': '1.59', 'raw': 1.59079}
{'display': '172,844,238.24', 'raw': 172844238.24}
{'display': '652,464.00', 'raw': 652464}
30303M102
US30303M1027
B7TL820
{'display': '264.91', 'raw': 264.91}
United States
NASDAQ
USD
1
USD
-
4
GOOGL
ALPHABET INC CLASS A
Communication
Equity
{'display': '8,815,957.22', 'raw': 168815957.22}
{'display': '1.55', 'raw': 1.55372}
{'display': '168,815,957.22', 'raw': 168815957.22}
{'display': '81,567.00', 'raw': 81567}
02079K305
US02079K3059
BYVY8G0
{'display': '2,069.66', 'raw': 2069.66}
United States
NASDAQ
USD
1
USD
-
我可以使用以下代码解析 S&P500 公司列表:
import requests
from bs4 import BeautifulSoup
import pandas as pd
import xlwings as xw
def get_sp500_info():
resp = requests.get("https://en.wikipedia.org/wiki/List_of_S%26P_500_companies")
soup = BeautifulSoup(resp.text, 'lxml')
stocks_info = []
tickers = []
securities = []
gics_industries = []
gics_sub_industries = []
table = soup.find('table', {'class': 'wikitable sortable'})
for row in table.findAll('tr')[1:]:
ticker = row.findAll('td')[0].text
security = row.findAll('td')[1].text
gics_industry = row.findAll('td')[3].text
gics_sub_industry = row.findAll('td')[4].text
tickers.append(ticker.lower().replace(r"\n", " "))
securities.append(security)
gics_industries.append(gics_industry.lower())
gics_sub_industries.append(gics_sub_industry.lower())
stocks_info.append(tickers)
stocks_info.append(securities)
stocks_info.append(gics_industries)
stocks_info.append(gics_sub_industries)
stocks_info_df = pd.DataFrame(stocks_info).T
stocks_info_df.columns=['tickers','security','gics_industry','gics_sub_industry']
stocks_info_df['seclabels'] = 'SP500'
return stocks_info_df
def open_in_excel(dataframe):
xw.view(dataframe)
if __name__ == "__main__":
open_in_excel(get_sp500_info())
现在我想解析与上述代码基本相同的Russel3000公司列表。而且它不起作用。
import requests
from bs4 import BeautifulSoup
import pandas as pd
import xlwings as xw
def get_russel3000_info():
resp = requests.get("https://www.ishares.com/us/products/239714/ishares-russell-3000-etf#holdings")
soup = BeautifulSoup(resp.text, "lxml")
stocks_info = []
tickers = []
securities = []
gics_industries = []
table = soup.find('table', {'class': 'display product-table border-row dataTable no-footer'})
for row in table.findAll('tr')[1:]: #Line A
ticker = row.findAll('td')[0].text
security = row.findAll('td')[1].text
gics_industry = row.findAll('td')[2].text
tickers.append(ticker.lower().replace(r"\n", " "))
securities.append(security)
gics_industries.append(gics_industry.lower())
stocks_info.append(tickers)
stocks_info.append(securities)
stocks_info.append(gics_industries)
stocks_info_df = pd.DataFrame(stocks_info).T
stocks_info_df.columns=['tickers','security','gics_industry']
stocks_info_df['seclabels'] = 'Russel3000'
return stocks_info_df
def open_in_excel(dataframe):
xw.view(dataframe)
if __name__ == "__main__":
open_in_excel(get_russel3000_info())
我不明白为什么它适用于 S&P500 但不适用于 Russel3000。 在“A 行”,我会收到以下错误:
Exception has occurred: AttributeError
'NoneType' object has no attribute 'findAll'
不应该 return“None”。 感谢您的指点:-)
您可以将 tables 直接加载到 pandas:
df = pd.read_html("https://en.wikipedia.org/wiki/List_of_S%26P_500_companies")
您可以使用 df[0]
、df[1]
等访问页面上的 tables。在 ishares.com
的情况下,特定的 table 不会'加载,因为它是通过 javascript 在本地加载的。一种解决方案是使用 Selenium
来完成这项工作:
from selenium import webdriver
import pandas as pd
import time
options = webdriver.ChromeOptions()
options.add_argument('--headless')
options.add_argument('--no-sandbox')
options.add_argument('--disable-dev-shm-usage')
url="https://www.ishares.com/us/products/239714/ishares-russell-3000-etf#holdings"
wd = webdriver.Chrome('chromedriver',options=options)
wd.get(url)
time.sleep(5) # sleep for a few seconds to allow loading the data
df = pd.read_html(wd.page_source)
df[7]
是您正在寻找的table:
Ticker | Name | Sector | Asset Class | Market Value | Weight (%) | Notional Value | Shares | CUSIP | ISIN | SEDOL | Accrual Date | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | AAPL | APPLE INC | Information Technology | Equity | 0,367,328.56 | 5.16 | 5.60367e+08 | 4.38506e+06 | 037833100 | US0378331005 | 2046251 | - |
1 | MSFT | MICROSOFT CORP | Information Technology | Equity | 2,112,717.24 | 4.44 | 4.82113e+08 | 2.03475e+06 | 594918104 | US5949181045 | 2588173 | - |
2 | AMZN | AMAZON COM INC | Consumer Discretionary | Equity | 2,479,373.96 | 3.34 | 3.62479e+08 | 115214 | 023135106 | US0231351067 | 2000019 | - |
3 | FB | FACEBOOK CLASS A INC | Communication | Equity | 2,844,238.24 | 1.59 | 1.72844e+08 | 652464 | 30303M102 | US30303M1027 | B7TL820 | - |
4 | GOOGL | ALPHABET INC CLASS A | Communication | Equity | 8,815,957.22 | 1.55 | 1.68816e+08 | 81567 | 02079K305 | US02079K3059 | BYVY8G0 | - |
更好的解决办法是直接加载json文件。正如您在 Firefox 或 Chrome 中检查网站时所看到的,table 数据是从此 json url: https://www.ishares.com/us/products/239714/ishares-russell-3000-etf/1467271812596.ajax?tab=all&fileType=json
加载的。将其加载到 pandas 的好处是可以一次性将完整的 2866 个条目放入数据框中。我们无法将它直接加载到 pandas,因为该文件包含一个 UTF-8 BOM header,但这会起作用:
import requests
import json
import pandas as pd
url = "https://www.ishares.com/us/products/239714/ishares-russell-3000-etf/1467271812596.ajax?tab=all&fileType=json"
r = requests.get(url)
json = json.loads(r.content.decode('utf-8-sig'))
df = pd.DataFrame(json['aaData'])
输出:
0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 | 11 | 12 | 13 | 14 | 15 | 16 | 17 | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | AAPL | APPLE INC | Information Technology | Equity | {'display': '0,367,328.56', 'raw': 560367328.56} | {'display': '5.16', 'raw': 5.15741} | {'display': '560,367,328.56', 'raw': 560367328.56} | {'display': '4,385,064.00', 'raw': 4385064} | 037833100 | US0378331005 | 2046251 | {'display': '127.79', 'raw': 127.79} | United States | NASDAQ | USD | 1 | USD | - |
1 | MSFT | MICROSOFT CORP | Information Technology | Equity | {'display': '2,112,717.24', 'raw': 482112717.24} | {'display': '4.44', 'raw': 4.43718} | {'display': '482,112,717.24', 'raw': 482112717.24} | {'display': '2,034,746.00', 'raw': 2034746} | 594918104 | US5949181045 | 2588173 | {'display': '236.94', 'raw': 236.94} | United States | NASDAQ | USD | 1 | USD | - |
2 | AMZN | AMAZON COM INC | Consumer Discretionary | Equity | {'display': '2,479,373.96', 'raw': 362479373.96} | {'display': '3.34', 'raw': 3.33612} | {'display': '362,479,373.96', 'raw': 362479373.96} | {'display': '115,214.00', 'raw': 115214} | 023135106 | US0231351067 | 2000019 | {'display': '3,146.14', 'raw': 3146.14} | United States | NASDAQ | USD | 1 | USD | - |
3 | FB | FACEBOOK CLASS A INC | Communication | Equity | {'display': '2,844,238.24', 'raw': 172844238.24} | {'display': '1.59', 'raw': 1.59079} | {'display': '172,844,238.24', 'raw': 172844238.24} | {'display': '652,464.00', 'raw': 652464} | 30303M102 | US30303M1027 | B7TL820 | {'display': '264.91', 'raw': 264.91} | United States | NASDAQ | USD | 1 | USD | - |
4 | GOOGL | ALPHABET INC CLASS A | Communication | Equity | {'display': '8,815,957.22', 'raw': 168815957.22} | {'display': '1.55', 'raw': 1.55372} | {'display': '168,815,957.22', 'raw': 168815957.22} | {'display': '81,567.00', 'raw': 81567} | 02079K305 | US02079K3059 | BYVY8G0 | {'display': '2,069.66', 'raw': 2069.66} | United States | NASDAQ | USD | 1 | USD | - |