yfinance下载的多级列名如何处理
How to deal with multi-level column names downloaded with yfinance
我有一个代码列表 (tickerStrings
),我必须一次下载所有代码。当我尝试使用 Pandas' read_csv
时,它不会读取 CSV file in the way it does when I download the data from yfinance.
我通常像这样通过自动收报机访问我的数据:data['AAPL']
或 data['AAPL'].Close
,但是当我从 CSV 文件读取数据时,它不允许我这样做。
if path.exists(data_file):
data = pd.read_csv(data_file, low_memory=False)
data = pd.DataFrame(data)
print(data.head())
else:
data = yf.download(tickerStrings, group_by="Ticker", period=prd, interval=intv)
data.to_csv(data_file)
这是打印输出:
Unnamed: 0 OLN OLN.1 OLN.2 OLN.3 ... W.1 W.2 W.3 W.4 W.5
0 NaN Open High Low Close ... High Low Close Adj Close Volume
1 Datetime NaN NaN NaN NaN ... NaN NaN NaN NaN NaN
2 2020-06-25 09:30:00-04:00 11.1899995803833 11.220000267028809 11.010000228881836 11.079999923706055 ... 201.2899932861328 197.3000030517578 197.36000061035156 197.36000061035156 112156
3 2020-06-25 09:45:00-04:00 11.130000114440918 11.260000228881836 11.100000381469727 11.15999984741211 ... 200.48570251464844 196.47999572753906 199.74000549316406 199.74000549316406 83943
4 2020-06-25 10:00:00-04:00 11.170000076293945 11.220000267028809 11.119999885559082 11.170000076293945 ... 200.49000549316406 198.19000244140625 200.4149932861328 200.4149932861328 88771
我在尝试访问数据时遇到的错误:
Traceback (most recent call last):
File "getdata.py", line 49, in processData
avg = data[x].Close.mean()
AttributeError: 'Series' object has no attribute 'Close'
将所有代码下载到具有单级列的单个数据框中 headers
选项 1
- 下载单个股票行情数据时,return编辑的数据框列名称是单个级别,但没有行情列。
- 这将为每个代码下载数据,添加一个代码列,并根据所有需要的代码创建一个数据框。
import yfinance as yf
import pandas as pd
tickerStrings = ['AAPL', 'MSFT']
df_list = list()
for ticker in tickerStrings:
data = yf.download(ticker, group_by="Ticker", period='2d')
data['ticker'] = ticker # add this column because the dataframe doesn't contain a column with the ticker
df_list.append(data)
# combine all dataframes into a single dataframe
df = pd.concat(df_list)
# save to csv
df.to_csv('ticker.csv')
选项 2
- 下载所有代码并拆开关卡
group_by='Ticker'
将代码放在列名 level=0
处
tickerStrings = ['AAPL', 'MSFT']
df = yf.download(tickerStrings, group_by='Ticker', period='2d')
df = df.stack(level=0).rename_axis(['Date', 'Ticker']).reset_index(level=1)
读取 yfinance
csv 已存储 multi-level 列名称
- 如果您希望保留并读入具有 multi-level 列索引的文件,请使用以下代码,这会将数据框 return 恢复为原始形式。
df = pd.read_csv('test.csv', header=[0, 1])
df.drop([0], axis=0, inplace=True) # drop this row because it only has one column with Date in it
df[('Unnamed: 0_level_0', 'Unnamed: 0_level_1')] = pd.to_datetime(df[('Unnamed: 0_level_0', 'Unnamed: 0_level_1')], format='%Y-%m-%d') # convert the first column to a datetime
df.set_index(('Unnamed: 0_level_0', 'Unnamed: 0_level_1'), inplace=True) # set the first column as the index
df.index.name = None # rename the index
- 问题是,
tickerStrings
是一个代码列表,它导致最终数据框具有 multi-level 列名称
AAPL MSFT
Open High Low Close Adj Close Volume Open High Low Close Adj Close Volume
Date
1980-12-12 0.513393 0.515625 0.513393 0.513393 0.405683 117258400 NaN NaN NaN NaN NaN NaN
1980-12-15 0.488839 0.488839 0.486607 0.486607 0.384517 43971200 NaN NaN NaN NaN NaN NaN
1980-12-16 0.453125 0.453125 0.450893 0.450893 0.356296 26432000 NaN NaN NaN NaN NaN NaN
1980-12-17 0.462054 0.464286 0.462054 0.462054 0.365115 21610400 NaN NaN NaN NaN NaN NaN
1980-12-18 0.475446 0.477679 0.475446 0.475446 0.375698 18362400 NaN NaN NaN NaN NaN NaN
- 当它被保存到 csv 时,它看起来像下面的例子,并产生一个数据框,就像你遇到问题一样。
,AAPL,AAPL,AAPL,AAPL,AAPL,AAPL,MSFT,MSFT,MSFT,MSFT,MSFT,MSFT
,Open,High,Low,Close,Adj Close,Volume,Open,High,Low,Close,Adj Close,Volume
Date,,,,,,,,,,,,
1980-12-12,0.5133928656578064,0.515625,0.5133928656578064,0.5133928656578064,0.40568336844444275,117258400,,,,,,
1980-12-15,0.4888392984867096,0.4888392984867096,0.4866071343421936,0.4866071343421936,0.3845173120498657,43971200,,,,,,
1980-12-16,0.453125,0.453125,0.4508928656578064,0.4508928656578064,0.3562958240509033,26432000,,,,,,
将 multi-level 列展平为一个级别并添加代码列
- 如果股票代码是列名称的
level=0
(顶部)
- 使用
group_by='Ticker'
时
df.stack(level=0).rename_axis(['Date', 'Ticker']).reset_index(level=1)
- 如果股票代码是列名称的
level=1
(底部)
df.stack(level=1).rename_axis(['Date', 'Ticker']).reset_index(level=1)
下载每个代码并将其保存到单独的文件
- 我建议单独下载并保存每个代码,如下所示:
import yfinance as yf
import pandas as pd
tickerStrings = ['AAPL', 'MSFT']
for ticker in tickerStrings:
data = yf.download(ticker, group_by="Ticker", period=prd, interval=intv)
data['ticker'] = ticker # add this column because the dataframe doesn't contain a column with the ticker
data.to_csv(f'ticker_{ticker}.csv') # ticker_AAPL.csv for example
data
看起来像
Open High Low Close Adj Close Volume ticker
Date
1986-03-13 0.088542 0.101562 0.088542 0.097222 0.062205 1031788800 MSFT
1986-03-14 0.097222 0.102431 0.097222 0.100694 0.064427 308160000 MSFT
1986-03-17 0.100694 0.103299 0.100694 0.102431 0.065537 133171200 MSFT
1986-03-18 0.102431 0.103299 0.098958 0.099826 0.063871 67766400 MSFT
1986-03-19 0.099826 0.100694 0.097222 0.098090 0.062760 47894400 MSFT
- 生成的 csv 看起来像
Date,Open,High,Low,Close,Adj Close,Volume,ticker
1986-03-13,0.0885416641831398,0.1015625,0.0885416641831398,0.0972222238779068,0.0622050017118454,1031788800,MSFT
1986-03-14,0.0972222238779068,0.1024305522441864,0.0972222238779068,0.1006944477558136,0.06442664563655853,308160000,MSFT
1986-03-17,0.1006944477558136,0.1032986119389534,0.1006944477558136,0.1024305522441864,0.0655374601483345,133171200,MSFT
1986-03-18,0.1024305522441864,0.1032986119389534,0.0989583358168602,0.0998263880610466,0.06387123465538025,67766400,MSFT
1986-03-19,0.0998263880610466,0.1006944477558136,0.0972222238779068,0.0980902761220932,0.06276042759418488,47894400,MSFT
读入上一节保存的多个文件并创建单个数据帧
import pandas as pd
from pathlib import Path
# set the path to the files
p = Path('c:/path_to_files')
# find the files; this is a generator, not a list
files = p.glob('ticker_*.csv')
# read the files into a dataframe
df = pd.concat([pd.read_csv(file) for file in files])
另一个维护 pandas 数据框但删除不需要的数据的选项是将列索引从多索引更改为单个索引。由于您只关心 'Close' 列,因此第一步将丢弃其他列:
df = yf.download(...)
df = df[['Close']]
这很好,但是每列都有一个多索引,看起来像 (Close/AAPL) 或 (Close/MSFT) 等。您真正想要的只是代码。
df.columns = [col[1] for col in df.columns]
现在,如果您想将数据框拆分为每一列的单独数据框,您可以使用列表理解来完成此操作。
separated = [df.iloc[:,i] for i in range(len(df.columns))]
把它变成d[ticker]=df
的dict:
df = yf.download(tickers, group_by="ticker")
d = {idx: gp.xs(idx, level=0, axis=1) for idx, gp in df.groupby(level=0, axis=1)}
使用下面的行写入和读取 CSV 文件。它们的格式与您从 yfinance API.
下载的格式完全相同
写入文件
data.to_csv('file_loc')
读取文件
data = pd.read_csv('file_loc', header=[0, 1], index_col=[0])
我有一个代码列表 (tickerStrings
),我必须一次下载所有代码。当我尝试使用 Pandas' read_csv
时,它不会读取 CSV file in the way it does when I download the data from yfinance.
我通常像这样通过自动收报机访问我的数据:data['AAPL']
或 data['AAPL'].Close
,但是当我从 CSV 文件读取数据时,它不允许我这样做。
if path.exists(data_file):
data = pd.read_csv(data_file, low_memory=False)
data = pd.DataFrame(data)
print(data.head())
else:
data = yf.download(tickerStrings, group_by="Ticker", period=prd, interval=intv)
data.to_csv(data_file)
这是打印输出:
Unnamed: 0 OLN OLN.1 OLN.2 OLN.3 ... W.1 W.2 W.3 W.4 W.5
0 NaN Open High Low Close ... High Low Close Adj Close Volume
1 Datetime NaN NaN NaN NaN ... NaN NaN NaN NaN NaN
2 2020-06-25 09:30:00-04:00 11.1899995803833 11.220000267028809 11.010000228881836 11.079999923706055 ... 201.2899932861328 197.3000030517578 197.36000061035156 197.36000061035156 112156
3 2020-06-25 09:45:00-04:00 11.130000114440918 11.260000228881836 11.100000381469727 11.15999984741211 ... 200.48570251464844 196.47999572753906 199.74000549316406 199.74000549316406 83943
4 2020-06-25 10:00:00-04:00 11.170000076293945 11.220000267028809 11.119999885559082 11.170000076293945 ... 200.49000549316406 198.19000244140625 200.4149932861328 200.4149932861328 88771
我在尝试访问数据时遇到的错误:
Traceback (most recent call last):
File "getdata.py", line 49, in processData
avg = data[x].Close.mean()
AttributeError: 'Series' object has no attribute 'Close'
将所有代码下载到具有单级列的单个数据框中 headers
选项 1
- 下载单个股票行情数据时,return编辑的数据框列名称是单个级别,但没有行情列。
- 这将为每个代码下载数据,添加一个代码列,并根据所有需要的代码创建一个数据框。
import yfinance as yf
import pandas as pd
tickerStrings = ['AAPL', 'MSFT']
df_list = list()
for ticker in tickerStrings:
data = yf.download(ticker, group_by="Ticker", period='2d')
data['ticker'] = ticker # add this column because the dataframe doesn't contain a column with the ticker
df_list.append(data)
# combine all dataframes into a single dataframe
df = pd.concat(df_list)
# save to csv
df.to_csv('ticker.csv')
选项 2
- 下载所有代码并拆开关卡
group_by='Ticker'
将代码放在列名level=0
处
tickerStrings = ['AAPL', 'MSFT']
df = yf.download(tickerStrings, group_by='Ticker', period='2d')
df = df.stack(level=0).rename_axis(['Date', 'Ticker']).reset_index(level=1)
读取 yfinance
csv 已存储 multi-level 列名称
- 如果您希望保留并读入具有 multi-level 列索引的文件,请使用以下代码,这会将数据框 return 恢复为原始形式。
df = pd.read_csv('test.csv', header=[0, 1])
df.drop([0], axis=0, inplace=True) # drop this row because it only has one column with Date in it
df[('Unnamed: 0_level_0', 'Unnamed: 0_level_1')] = pd.to_datetime(df[('Unnamed: 0_level_0', 'Unnamed: 0_level_1')], format='%Y-%m-%d') # convert the first column to a datetime
df.set_index(('Unnamed: 0_level_0', 'Unnamed: 0_level_1'), inplace=True) # set the first column as the index
df.index.name = None # rename the index
- 问题是,
tickerStrings
是一个代码列表,它导致最终数据框具有 multi-level 列名称
AAPL MSFT
Open High Low Close Adj Close Volume Open High Low Close Adj Close Volume
Date
1980-12-12 0.513393 0.515625 0.513393 0.513393 0.405683 117258400 NaN NaN NaN NaN NaN NaN
1980-12-15 0.488839 0.488839 0.486607 0.486607 0.384517 43971200 NaN NaN NaN NaN NaN NaN
1980-12-16 0.453125 0.453125 0.450893 0.450893 0.356296 26432000 NaN NaN NaN NaN NaN NaN
1980-12-17 0.462054 0.464286 0.462054 0.462054 0.365115 21610400 NaN NaN NaN NaN NaN NaN
1980-12-18 0.475446 0.477679 0.475446 0.475446 0.375698 18362400 NaN NaN NaN NaN NaN NaN
- 当它被保存到 csv 时,它看起来像下面的例子,并产生一个数据框,就像你遇到问题一样。
,AAPL,AAPL,AAPL,AAPL,AAPL,AAPL,MSFT,MSFT,MSFT,MSFT,MSFT,MSFT
,Open,High,Low,Close,Adj Close,Volume,Open,High,Low,Close,Adj Close,Volume
Date,,,,,,,,,,,,
1980-12-12,0.5133928656578064,0.515625,0.5133928656578064,0.5133928656578064,0.40568336844444275,117258400,,,,,,
1980-12-15,0.4888392984867096,0.4888392984867096,0.4866071343421936,0.4866071343421936,0.3845173120498657,43971200,,,,,,
1980-12-16,0.453125,0.453125,0.4508928656578064,0.4508928656578064,0.3562958240509033,26432000,,,,,,
将 multi-level 列展平为一个级别并添加代码列
- 如果股票代码是列名称的
level=0
(顶部)- 使用
group_by='Ticker'
时
- 使用
df.stack(level=0).rename_axis(['Date', 'Ticker']).reset_index(level=1)
- 如果股票代码是列名称的
level=1
(底部)
df.stack(level=1).rename_axis(['Date', 'Ticker']).reset_index(level=1)
下载每个代码并将其保存到单独的文件
- 我建议单独下载并保存每个代码,如下所示:
import yfinance as yf
import pandas as pd
tickerStrings = ['AAPL', 'MSFT']
for ticker in tickerStrings:
data = yf.download(ticker, group_by="Ticker", period=prd, interval=intv)
data['ticker'] = ticker # add this column because the dataframe doesn't contain a column with the ticker
data.to_csv(f'ticker_{ticker}.csv') # ticker_AAPL.csv for example
data
看起来像
Open High Low Close Adj Close Volume ticker
Date
1986-03-13 0.088542 0.101562 0.088542 0.097222 0.062205 1031788800 MSFT
1986-03-14 0.097222 0.102431 0.097222 0.100694 0.064427 308160000 MSFT
1986-03-17 0.100694 0.103299 0.100694 0.102431 0.065537 133171200 MSFT
1986-03-18 0.102431 0.103299 0.098958 0.099826 0.063871 67766400 MSFT
1986-03-19 0.099826 0.100694 0.097222 0.098090 0.062760 47894400 MSFT
- 生成的 csv 看起来像
Date,Open,High,Low,Close,Adj Close,Volume,ticker
1986-03-13,0.0885416641831398,0.1015625,0.0885416641831398,0.0972222238779068,0.0622050017118454,1031788800,MSFT
1986-03-14,0.0972222238779068,0.1024305522441864,0.0972222238779068,0.1006944477558136,0.06442664563655853,308160000,MSFT
1986-03-17,0.1006944477558136,0.1032986119389534,0.1006944477558136,0.1024305522441864,0.0655374601483345,133171200,MSFT
1986-03-18,0.1024305522441864,0.1032986119389534,0.0989583358168602,0.0998263880610466,0.06387123465538025,67766400,MSFT
1986-03-19,0.0998263880610466,0.1006944477558136,0.0972222238779068,0.0980902761220932,0.06276042759418488,47894400,MSFT
读入上一节保存的多个文件并创建单个数据帧
import pandas as pd
from pathlib import Path
# set the path to the files
p = Path('c:/path_to_files')
# find the files; this is a generator, not a list
files = p.glob('ticker_*.csv')
# read the files into a dataframe
df = pd.concat([pd.read_csv(file) for file in files])
另一个维护 pandas 数据框但删除不需要的数据的选项是将列索引从多索引更改为单个索引。由于您只关心 'Close' 列,因此第一步将丢弃其他列:
df = yf.download(...)
df = df[['Close']]
这很好,但是每列都有一个多索引,看起来像 (Close/AAPL) 或 (Close/MSFT) 等。您真正想要的只是代码。
df.columns = [col[1] for col in df.columns]
现在,如果您想将数据框拆分为每一列的单独数据框,您可以使用列表理解来完成此操作。
separated = [df.iloc[:,i] for i in range(len(df.columns))]
把它变成d[ticker]=df
的dict:
df = yf.download(tickers, group_by="ticker")
d = {idx: gp.xs(idx, level=0, axis=1) for idx, gp in df.groupby(level=0, axis=1)}
使用下面的行写入和读取 CSV 文件。它们的格式与您从 yfinance API.
下载的格式完全相同写入文件
data.to_csv('file_loc')
读取文件
data = pd.read_csv('file_loc', header=[0, 1], index_col=[0])