Pandas 来自数据框字典的面板 Returns NaN
Pandas Panel from Dict of Dataframes Returns NaNs
我有一组 DataFrames,我正试图将其变成一个面板。
这是我的代码:
# OPEN THE FILES INTO DATAFRAMES
filenames = ['Yahoo_2016-01-17.csv', 'Yahoo_2016-01-18.csv',
'Yahoo_2016-01-19.csv','Yahoo_2016-01-23.csv','Yahoo_2016-01-27.csv',
'Yahoo_2016-02-05.csv', 'Yahoo_2016-02-06.csv', 'Yahoo_2016-02-09.csv',
'Yahoo_2016-02-11.csv', 'Yahoo_2016-02-13.csv', 'Yahoo_2016-02-15.csv',
'Yahoo_2016-02-16.csv', 'Yahoo_2016-02-29.csv']
dates = np.array(['2016-01-17', '2016-01-18', '2016-01-19', '2016-01-23',
'2016-01-27', '2016-02-05', '2016-02-06','2016-02-09',
'2016-02-11', '2016-02-13', '2016-02-15', '2016-02-16',
'2016-02-29']).astype('datetime64[D]')
filepath = '/Users/RickS/Documents/Investing/Stock_files/GENERAL/'
dfs = [pd.read_csv(filepath+f) for f in filenames]
# Panel not working...
panel = pd.Panel(dict([(date, df) for date in dates for df in dfs]))
panel.swapaxes('major','minor')
然而,当我尝试读取面板时,每个数据框中的所有值都变成了 NaN:
当我单独查看数据帧时,它们看起来都很好。
这是导入 df 的 csv 文件之一:
example_csv_file
需要注意的一件事可能(也可能不重要)是每个数据帧的 dtype 并不完全相同:
In [24]: dfs[1].dtypes
Out[24]:
Name object
Symbol object
Previous_Close float64
Average_Daily_Volume int64
Change_&_Percent_Change object
Earnings/Share float64
EPS_Estimate_Current_Year float64
EPS_Estimate_Next_Quarter float64
EPS_Estimate_Next_Year float64
52-week_Low float64
52-week_High float64
EBITDA object
200-day_Moving_Average float64
P/E_Ratio float64
PEG_Ratio float64
Short_Ratio float64
1_yr_Target_Price float64
52-week_Range object
Date object
dtype: object
我做错了什么?
所有 NaN 为空面板的原因是您的 dates
numpy 数组当前存储为 datetime64
类型。显然,pandas 面板对象不能很好地与底层字典键一起使用。
只需删除 astype
或什至使用将日期呈现为字符串键的列表或元组。但由于字典键是按天计算的,因此每个键都将是唯一的,可以满足您的面板需求。
dates = np.array(['2016-01-17', '2016-01-18', '2016-01-19', '2016-01-23',
'2016-01-27', '2016-02-05', '2016-02-06','2016-02-09',
'2016-02-11', '2016-02-13', '2016-02-15', '2016-02-16',
'2016-02-29'])
dates = ['2016-01-17', '2016-01-18', '2016-01-19', '2016-01-23',
'2016-01-27', '2016-02-05', '2016-02-06','2016-02-09',
'2016-02-11', '2016-02-13', '2016-02-15', '2016-02-16',
'2016-02-29']
然而,这带来了我之前的发现。目前,dict()
函数中的列表理解将 return 仅包含 last 数据框的面板,重复 13 次。原因是下面的列表理解 returns 是 dfs
列表和 dates
数组之间的总组合集,其长度等于两个集合的乘积:13 X 13(即交叉 join/cartesian加入)。下面输出看:
[(date, df) for date in dates for df in dfs]
一旦你将 dict()
应用到上面,你强制 13 个唯一 dates
携带最后一个 df
的值,基本上拉入最后一个 组合配对.
考虑使用 zip()
一起迭代两个集合的每个项目:
dfDict = {}
for f,d in zip(filenames, dates):
dfDict[d] = pd.read_csv(filepath+f)
panel = pd.Panel(dfDict)
或更短的:
dfs = [pd.read_csv(filepath+f) for f in filenames]
panel = pd.Panel(dict([i for i in zip(dates, dfs)]))
我有一组 DataFrames,我正试图将其变成一个面板。 这是我的代码:
# OPEN THE FILES INTO DATAFRAMES
filenames = ['Yahoo_2016-01-17.csv', 'Yahoo_2016-01-18.csv',
'Yahoo_2016-01-19.csv','Yahoo_2016-01-23.csv','Yahoo_2016-01-27.csv',
'Yahoo_2016-02-05.csv', 'Yahoo_2016-02-06.csv', 'Yahoo_2016-02-09.csv',
'Yahoo_2016-02-11.csv', 'Yahoo_2016-02-13.csv', 'Yahoo_2016-02-15.csv',
'Yahoo_2016-02-16.csv', 'Yahoo_2016-02-29.csv']
dates = np.array(['2016-01-17', '2016-01-18', '2016-01-19', '2016-01-23',
'2016-01-27', '2016-02-05', '2016-02-06','2016-02-09',
'2016-02-11', '2016-02-13', '2016-02-15', '2016-02-16',
'2016-02-29']).astype('datetime64[D]')
filepath = '/Users/RickS/Documents/Investing/Stock_files/GENERAL/'
dfs = [pd.read_csv(filepath+f) for f in filenames]
# Panel not working...
panel = pd.Panel(dict([(date, df) for date in dates for df in dfs]))
panel.swapaxes('major','minor')
然而,当我尝试读取面板时,每个数据框中的所有值都变成了 NaN:
当我单独查看数据帧时,它们看起来都很好。 这是导入 df 的 csv 文件之一: example_csv_file
需要注意的一件事可能(也可能不重要)是每个数据帧的 dtype 并不完全相同:
In [24]: dfs[1].dtypes
Out[24]:
Name object
Symbol object
Previous_Close float64
Average_Daily_Volume int64
Change_&_Percent_Change object
Earnings/Share float64
EPS_Estimate_Current_Year float64
EPS_Estimate_Next_Quarter float64
EPS_Estimate_Next_Year float64
52-week_Low float64
52-week_High float64
EBITDA object
200-day_Moving_Average float64
P/E_Ratio float64
PEG_Ratio float64
Short_Ratio float64
1_yr_Target_Price float64
52-week_Range object
Date object
dtype: object
我做错了什么?
所有 NaN 为空面板的原因是您的 dates
numpy 数组当前存储为 datetime64
类型。显然,pandas 面板对象不能很好地与底层字典键一起使用。
只需删除 astype
或什至使用将日期呈现为字符串键的列表或元组。但由于字典键是按天计算的,因此每个键都将是唯一的,可以满足您的面板需求。
dates = np.array(['2016-01-17', '2016-01-18', '2016-01-19', '2016-01-23',
'2016-01-27', '2016-02-05', '2016-02-06','2016-02-09',
'2016-02-11', '2016-02-13', '2016-02-15', '2016-02-16',
'2016-02-29'])
dates = ['2016-01-17', '2016-01-18', '2016-01-19', '2016-01-23',
'2016-01-27', '2016-02-05', '2016-02-06','2016-02-09',
'2016-02-11', '2016-02-13', '2016-02-15', '2016-02-16',
'2016-02-29']
然而,这带来了我之前的发现。目前,dict()
函数中的列表理解将 return 仅包含 last 数据框的面板,重复 13 次。原因是下面的列表理解 returns 是 dfs
列表和 dates
数组之间的总组合集,其长度等于两个集合的乘积:13 X 13(即交叉 join/cartesian加入)。下面输出看:
[(date, df) for date in dates for df in dfs]
一旦你将 dict()
应用到上面,你强制 13 个唯一 dates
携带最后一个 df
的值,基本上拉入最后一个 组合配对.
考虑使用 zip()
一起迭代两个集合的每个项目:
dfDict = {}
for f,d in zip(filenames, dates):
dfDict[d] = pd.read_csv(filepath+f)
panel = pd.Panel(dfDict)
或更短的:
dfs = [pd.read_csv(filepath+f) for f in filenames]
panel = pd.Panel(dict([i for i in zip(dates, dfs)]))