将 NASDAQ HTML table 读取到 Dataframe
Read a NASDAQ HTML table to a Dataframe
我使用此代码从纳斯达克获得了最新的交易公司列表,但是我希望在数据框中显示结果,而不仅仅是包含我可能不需要的所有其他信息的列表。
有什么想法可以实现吗?谢谢
解析最新纳斯达克公司
from bs4 import BeautifulSoup
import requests
r=requests.get('https://www.nasdaq.com/screening/companies-by
industry.aspx
exchange=NASDAQ&sortname=marketcap&sorttype=1&pagesize=4000')
data = r.text
soup = BeautifulSoup(data, "html.parser")
table = soup.find( "table", {"id":"CompanylistResults"} )
for row in table.findAll("tr"):
for cell in row("td"):
print (cell.get_text().strip())
看起来您正在寻找恰当命名的 read_html,但您需要尝试直到得到您想要的。你的情况:
>>> import pandas as pd
>>> df=pd.read_html(table.prettify(),flavor='bs4')[0]
>>> df.columns = [c.strip() for c in df.columns]
见下面的输出。
第一行是完成工作的内容,第二行只是去掉了 header 中所有那些讨厌的空格和新行。貌似有个隐藏的ADR TSO
,好像没什么用,不知道是什么的可以扔掉。删除所有偶数行也可能有意义,因为它们只是奇数行的延续,据我所知是无用的链接。在一行中:
>>> df = df.drop(['ADR TSO'], axis=1) #Drop useless column
>>> df1= df[::2] #To get rid of even rows
>>> df2= df[~df['Name'].str.contains('Stock Quote')].head() #By string filtration if we are not sure about the odd/even thing
原始头部的输出仅供展示:
>>> df.head()
Name Symbol Market Cap \
0 Amazon.com, Inc. AMZN 2.18B
1 AMZN Stock Quote AMZN Ratings AMZN Stock Report NaN NaN
2 Microsoft Corporation MSFT 9.12B
3 MSFT Stock Quote MSFT Ratings MSFT Stock Report NaN NaN
4 Alphabet Inc. GOOGL 0.3B
ADR TSO Country IPO Year \
0 NaN United States 1997
1 NaN NaN NaN
2 NaN United States 1986
3 NaN NaN NaN
4 NaN United States n/a
Subsector
0 Catalog/Specialty Distribution
1 NaN
2 Computer Software: Prepackaged Software
3 NaN
4 Computer Software: Programming, Data Processing
清理后的输出df.head()
:
Name Symbol Market Cap Country IPO Year \
0 Amazon.com, Inc. AMZN 2.18B United States 1997
2 Microsoft Corporation MSFT 9.12B United States 1986
4 Alphabet Inc. GOOGL 0.3B United States n/a
6 Alphabet Inc. GOOG 5.24B United States 2004
8 Apple Inc. AAPL 0.3B United States 1980
Subsector
0 Catalog/Specialty Distribution
2 Computer Software: Prepackaged Software
4 Computer Software: Programming, Data Processing
6 Computer Software: Programming, Data Processing
8 Computer Manufacturing
我使用此代码从纳斯达克获得了最新的交易公司列表,但是我希望在数据框中显示结果,而不仅仅是包含我可能不需要的所有其他信息的列表。
有什么想法可以实现吗?谢谢
解析最新纳斯达克公司
from bs4 import BeautifulSoup
import requests
r=requests.get('https://www.nasdaq.com/screening/companies-by
industry.aspx
exchange=NASDAQ&sortname=marketcap&sorttype=1&pagesize=4000')
data = r.text
soup = BeautifulSoup(data, "html.parser")
table = soup.find( "table", {"id":"CompanylistResults"} )
for row in table.findAll("tr"):
for cell in row("td"):
print (cell.get_text().strip())
看起来您正在寻找恰当命名的 read_html,但您需要尝试直到得到您想要的。你的情况:
>>> import pandas as pd
>>> df=pd.read_html(table.prettify(),flavor='bs4')[0]
>>> df.columns = [c.strip() for c in df.columns]
见下面的输出。
第一行是完成工作的内容,第二行只是去掉了 header 中所有那些讨厌的空格和新行。貌似有个隐藏的ADR TSO
,好像没什么用,不知道是什么的可以扔掉。删除所有偶数行也可能有意义,因为它们只是奇数行的延续,据我所知是无用的链接。在一行中:
>>> df = df.drop(['ADR TSO'], axis=1) #Drop useless column
>>> df1= df[::2] #To get rid of even rows
>>> df2= df[~df['Name'].str.contains('Stock Quote')].head() #By string filtration if we are not sure about the odd/even thing
原始头部的输出仅供展示:
>>> df.head()
Name Symbol Market Cap \
0 Amazon.com, Inc. AMZN 2.18B
1 AMZN Stock Quote AMZN Ratings AMZN Stock Report NaN NaN
2 Microsoft Corporation MSFT 9.12B
3 MSFT Stock Quote MSFT Ratings MSFT Stock Report NaN NaN
4 Alphabet Inc. GOOGL 0.3B
ADR TSO Country IPO Year \
0 NaN United States 1997
1 NaN NaN NaN
2 NaN United States 1986
3 NaN NaN NaN
4 NaN United States n/a
Subsector
0 Catalog/Specialty Distribution
1 NaN
2 Computer Software: Prepackaged Software
3 NaN
4 Computer Software: Programming, Data Processing
清理后的输出df.head()
:
Name Symbol Market Cap Country IPO Year \
0 Amazon.com, Inc. AMZN 2.18B United States 1997
2 Microsoft Corporation MSFT 9.12B United States 1986
4 Alphabet Inc. GOOGL 0.3B United States n/a
6 Alphabet Inc. GOOG 5.24B United States 2004
8 Apple Inc. AAPL 0.3B United States 1980
Subsector
0 Catalog/Specialty Distribution
2 Computer Software: Prepackaged Software
4 Computer Software: Programming, Data Processing
6 Computer Software: Programming, Data Processing
8 Computer Manufacturing