对多索引 pandas 列求和
Sum over multiindex pandas columns
我想创建一个数据框,其中列(年、季度、月)和索引(某些属性)都是分层的,即多索引。我想对某些级别进行总结,例如对属于一个季度的整个月进行总结。在 pandas 中,可以通过例如以下行:
# Axis 1 = columns, level 0 = year, level 1 = quarter
df.sum(axis=1, level=[0, 1]
这一直有效,直到在一些奇怪的情况下索引不再正确识别,触发错误消息 No axis named 1 for object type <class 'pandas.core.series.Series'>
。
在下面的代码中我创建了两个相同的dataframes(在两个轴上都是multiindex),只有一个区别:df1
创建时不填充,df2
创建时直接填充。求和适用于 df2
,但不适用于 df1
。我不明白,后台发生了什么,有人能给我指出一个解决方案来理解这种差异吗?
import pandas as pd
import numpy as np
cols = [(y, divmod(m - 1, 3)[0] + 1, m)
for y in list(range(2011, 2014)) for m in list(range(1, 13))]
inds = [(a, b, c)
for a in ["a1", "a2"] for b in ["b1", "b2"] for c in ["c1", "c2"]]
df1 = pd.DataFrame(index=pd.MultiIndex.from_tuples(inds, names=["a", "b", "c"]),
columns=pd.MultiIndex.from_tuples(cols, names=["year", "quarter", "month"]))
df2 = pd.DataFrame(np.ones(df1.shape),
index=pd.MultiIndex.from_tuples(inds, names=["a", "b", "c"]),
columns=pd.MultiIndex.from_tuples(cols, names=["year", "quarter", "month"]))
for (col, ind) in [(col, ind) for ind in df1.index.values for col in df1.columns.values]:
entry = np.random.rand()
df1.loc[ind, col] = entry
df2.loc[ind, col] = entry
try:
df1.sum(axis=1, level=[0, 1])
print("Sum over df1 did work")
except:
print("Sum over df1 did not work...")
try:
df2.sum(axis=1, level=[0, 1])
print("Sum over df2 did work")
except:
print("Sum over df2 did not work...")
PS: 发现一些提示,df1
中条目的类型是float
,df2
中是np.float64
,但这仍然没有帮助...
有问题 df1
中的所有值都是 object
s,显然是 string
s,但这里是 <class 'float'>
:
print (df1.dtypes)
year quarter month
2011 1 1 object
2 object
3 object
2 4 object
5 object
6 object
3 7 object
8 object
9 object
4 10 object
print (df2.dtypes)
year quarter month
2011 1 1 float64
2 float64
3 float64
2 4 float64
5 float64
6 float64
3 7 float64
8 float64
因此铸造作品:
try:
df1.astype(float).sum(axis=1, level=[0, 1])
print("Sum over df1 did work")
except:
print("Sum over df1 did not work...")
try:
df2.sum(axis=1, level=[0, 1])
print("Sum over df2 did work")
except:
print("Sum over df2 did not work...")
Sum over df1 did work
Sum over df2 did work
for (col, ind) in [(col, ind) for ind in df1.index.values for col in df1.columns.values]:
entry = np.random.rand()
df1.loc[ind, col] = entry
print (type(df1.loc[ind, col]))
df2.loc[ind, col] = entry
print (type(df2.loc[ind, col]))
<class 'float'>
<class 'numpy.float64'>
<class 'float'>
<class 'numpy.float64'>
<class 'float'>
<class 'numpy.float64'>
<class 'float'>
<class 'numpy.float64'>
<class 'float'>
<class 'numpy.float64'>
<class 'float'>
<class 'numpy.float64'>
<class 'float'>
<class 'numpy.float64'>
最好的是通过 numpy 数组创建 DataFrame
,然后一切正常:
df1 = pd.DataFrame(data = np.random.rand(len(inds), len(cols)),
index=pd.MultiIndex.from_tuples(inds, names=["a", "b", "c"]),
columns=pd.MultiIndex.from_tuples(cols, names=["year","quarter","month"]))
try:
df1.sum(axis=1, level=[0, 1])
print("Sum over df1 did work")
except:
print("Sum over df1 did not work...")
Sum over df1 did work
我想创建一个数据框,其中列(年、季度、月)和索引(某些属性)都是分层的,即多索引。我想对某些级别进行总结,例如对属于一个季度的整个月进行总结。在 pandas 中,可以通过例如以下行:
# Axis 1 = columns, level 0 = year, level 1 = quarter
df.sum(axis=1, level=[0, 1]
这一直有效,直到在一些奇怪的情况下索引不再正确识别,触发错误消息 No axis named 1 for object type <class 'pandas.core.series.Series'>
。
在下面的代码中我创建了两个相同的dataframes(在两个轴上都是multiindex),只有一个区别:df1
创建时不填充,df2
创建时直接填充。求和适用于 df2
,但不适用于 df1
。我不明白,后台发生了什么,有人能给我指出一个解决方案来理解这种差异吗?
import pandas as pd
import numpy as np
cols = [(y, divmod(m - 1, 3)[0] + 1, m)
for y in list(range(2011, 2014)) for m in list(range(1, 13))]
inds = [(a, b, c)
for a in ["a1", "a2"] for b in ["b1", "b2"] for c in ["c1", "c2"]]
df1 = pd.DataFrame(index=pd.MultiIndex.from_tuples(inds, names=["a", "b", "c"]),
columns=pd.MultiIndex.from_tuples(cols, names=["year", "quarter", "month"]))
df2 = pd.DataFrame(np.ones(df1.shape),
index=pd.MultiIndex.from_tuples(inds, names=["a", "b", "c"]),
columns=pd.MultiIndex.from_tuples(cols, names=["year", "quarter", "month"]))
for (col, ind) in [(col, ind) for ind in df1.index.values for col in df1.columns.values]:
entry = np.random.rand()
df1.loc[ind, col] = entry
df2.loc[ind, col] = entry
try:
df1.sum(axis=1, level=[0, 1])
print("Sum over df1 did work")
except:
print("Sum over df1 did not work...")
try:
df2.sum(axis=1, level=[0, 1])
print("Sum over df2 did work")
except:
print("Sum over df2 did not work...")
PS: 发现一些提示,df1
中条目的类型是float
,df2
中是np.float64
,但这仍然没有帮助...
有问题 df1
中的所有值都是 object
s,显然是 string
s,但这里是 <class 'float'>
:
print (df1.dtypes)
year quarter month
2011 1 1 object
2 object
3 object
2 4 object
5 object
6 object
3 7 object
8 object
9 object
4 10 object
print (df2.dtypes)
year quarter month
2011 1 1 float64
2 float64
3 float64
2 4 float64
5 float64
6 float64
3 7 float64
8 float64
因此铸造作品:
try:
df1.astype(float).sum(axis=1, level=[0, 1])
print("Sum over df1 did work")
except:
print("Sum over df1 did not work...")
try:
df2.sum(axis=1, level=[0, 1])
print("Sum over df2 did work")
except:
print("Sum over df2 did not work...")
Sum over df1 did work
Sum over df2 did work
for (col, ind) in [(col, ind) for ind in df1.index.values for col in df1.columns.values]:
entry = np.random.rand()
df1.loc[ind, col] = entry
print (type(df1.loc[ind, col]))
df2.loc[ind, col] = entry
print (type(df2.loc[ind, col]))
<class 'float'>
<class 'numpy.float64'>
<class 'float'>
<class 'numpy.float64'>
<class 'float'>
<class 'numpy.float64'>
<class 'float'>
<class 'numpy.float64'>
<class 'float'>
<class 'numpy.float64'>
<class 'float'>
<class 'numpy.float64'>
<class 'float'>
<class 'numpy.float64'>
最好的是通过 numpy 数组创建 DataFrame
,然后一切正常:
df1 = pd.DataFrame(data = np.random.rand(len(inds), len(cols)),
index=pd.MultiIndex.from_tuples(inds, names=["a", "b", "c"]),
columns=pd.MultiIndex.from_tuples(cols, names=["year","quarter","month"]))
try:
df1.sum(axis=1, level=[0, 1])
print("Sum over df1 did work")
except:
print("Sum over df1 did not work...")
Sum over df1 did work