Pandas 从长到宽,带有分层列 headers
Pandas long to wide with hierarchical column headers
我上下查找但找不到与我正在尝试做的非常相似的示例。简而言之,我对具有多个变量的实验进行了多次试验。数据被收集并存储在一个长数据帧 object 中,就像这样(生成的数据示例):
输入:
import pandas as pd
import itertools
from scipy import rand
trials = ['trial_1' , 'trial_2']
ppi = ['5_ppi', '10_ppi']
rd = ['7%', '12%']
iter = [rd, ppi, trials]
df = pd.DataFrame()
for k, params in enumerate(list(itertools.product(*iter))):
ext_data = [rand(), rand()]
load_data = [rand(), rand()]
df1 = pd.DataFrame({'ext': ext_data, 'load': load_data, 'trial': params[2], 'ppi': params[1], 'rd': params[0]})
df = pd.concat([df, df1], axis=0)
print(df)
输出:
df
ext load trial ppi rd
0 0.287997 0.874457 trial_1 5_ppi 7%
1 0.783776 0.878291 trial_1 5_ppi 7%
0 0.015054 0.886801 trial_2 5_ppi 7%
1 0.243617 0.234560 trial_2 5_ppi 7%
0 0.291621 0.519084 trial_1 10_ppi 7%
1 0.627786 0.072551 trial_1 10_ppi 7%
0 0.349199 0.235718 trial_2 10_ppi 7%
1 0.284535 0.328547 trial_2 10_ppi 7%
0 0.725747 0.688157 trial_1 5_ppi 12%
1 0.656839 0.297645 trial_1 5_ppi 12%
0 0.534276 0.794199 trial_2 5_ppi 12%
1 0.680596 0.381575 trial_2 5_ppi 12%
0 0.494404 0.246841 trial_1 10_ppi 12%
1 0.148489 0.549250 trial_1 10_ppi 12%
0 0.791440 0.372119 trial_2 10_ppi 12%
1 0.078047 0.552541 trial_2 10_ppi 12%
[16 rows x 5 columns]
(实际上每个试验有数百个数据点,而不是这里显示的两个)
我想将其转换为具有分层列的宽数据框,组织方式如下(我无法使用 markdown 使 values/headers 跨越多个列,所以请原谅截图)
我已经尝试了很多东西,但我已经失去了方向。我认为 pivot_table
让我最接近我想要的,但它最终会汇总值的平均值而不是列出它们,如下所示:
输入:
df.pivot_table(columns=['rd', 'ppi', 'trial'])
输出:
rd 12% 7%
ppi 10_ppi 5_ppi 10_ppi 5_ppi
trial trial_1 trial_2 trial_1 trial_2 trial_1 trial_2 trial_1 trial_2
ext 0.619217 0.661812 0.652555 0.167241 0.340024 0.42324 0.565166 0.436858
load 0.397430 0.102965 0.528162 0.550871 0.206560 0.28524 0.731204 0.303079
[2 rows x 8 columns]
也有可能每组数据有不同数量的数据点(即试验 1 可能包含 2 个数据点,而试验 2 可能有 3 个)。我束手无策。将这个长数据帧 object 转换为具有多个分层列的宽数据帧 object 的神奇命令是什么?
这可能不是最有效的答案,但我认为它正在实现您想要的结果并且应该适用于任何数量的观察。
def reformat(df, column_name):
groups = df.groupby(['rd','ppi','trial'])[column_name].apply(list)
temp_1 = groups.reset_index(name = 'listvalues')
# Make names for each of your new columns
col_names = [column_name + str(i) for i in np.arange(len(temp_1['listvalues'][0]))]
# Split listvalues into df where every item is its own column
listvalues = pd.DataFrame(temp_1["listvalues"].to_list(), columns=[col_names])
# Merge listvalues with the temp_1 df and get rid of the extra listvalues column
df = temp_1.join(listvalues)
del df['listvalues']
df = df.pivot_table(columns=['rd', 'ppi', 'trial'])
return df
# Do this twice and append to get the desired dataframe
df2 = reformat(df, 'ext')
df3 = reformat(df, 'load')
finaldf = df2.append(df3)
finaldf 的样子:
试试这个:
idx_cols = [*df][-1:-4:-1]
res = df.set_index(idx_cols + [df.groupby(idx_cols).cumcount()])
res = res.stack().unstack([0,1,2,-1])
print(res)
我想出了一个可能是更“正确”的解决方案,但它需要我重新排列我的 原始 数据集,以便 .pivot()
方法将正常工作。这是我 re-cast 数据的方式:
输入:
import pandas as pd
import itertools
from scipy import rand
trials = ['trial_1' , 'trial_2']
ppi = ['5_ppi', '10_ppi']
rd = ['7%', '12%']
iter = [rd, ppi, trials]
df = pd.DataFrame()
for k, params in enumerate(list(itertools.product(*iter))):
ext_data = [rand(), rand()]
load_data = [rand(), rand()]
df1 = pd.DataFrame({'data': ext_data, 'type': 'ext', 'trial': params[2], 'ppi': params[1], 'rd': params[0]})
df2 = pd.concat([df1, pd.DataFrame({'data': load_data, 'type': 'load', 'trial': params[2], 'ppi': params[1], 'rd': params[0]})])
df = pd.concat([df, df2], axis=0)
print(df)
输出:
df
data type trial ppi rd
0 0.959315 ext trial_1 5_ppi 7%
1 0.394340 ext trial_1 5_ppi 7%
0 0.140045 load trial_1 5_ppi 7%
1 0.519967 load trial_1 5_ppi 7%
0 0.483302 ext trial_2 5_ppi 7%
1 0.552380 ext trial_2 5_ppi 7%
0 0.907199 load trial_2 5_ppi 7%
1 0.123719 load trial_2 5_ppi 7%
0 0.190914 ext trial_1 10_ppi 7%
1 0.053163 ext trial_1 10_ppi 7%
0 0.085914 load trial_1 10_ppi 7%
1 0.749197 load trial_1 10_ppi 7%
0 0.112615 ext trial_2 10_ppi 7%
1 0.363111 ext trial_2 10_ppi 7%
0 0.508180 load trial_2 10_ppi 7%
1 0.459821 load trial_2 10_ppi 7%
0 0.346808 ext trial_1 5_ppi 12%
1 0.322950 ext trial_1 5_ppi 12%
0 0.642119 load trial_1 5_ppi 12%
1 0.101987 load trial_1 5_ppi 12%
0 0.488866 ext trial_2 5_ppi 12%
1 0.583071 ext trial_2 5_ppi 12%
0 0.119333 load trial_2 5_ppi 12%
1 0.800356 load trial_2 5_ppi 12%
0 0.733883 ext trial_1 10_ppi 12%
1 0.856037 ext trial_1 10_ppi 12%
0 0.980035 load trial_1 10_ppi 12%
1 0.364698 load trial_1 10_ppi 12%
0 0.697155 ext trial_2 10_ppi 12%
1 0.712375 ext trial_2 10_ppi 12%
0 0.285191 load trial_2 10_ppi 12%
1 0.198097 load trial_2 10_ppi 12%
[32 rows x 5 columns]
然后,如上所示,将数据 完全 投射,我将 .pivot()
应用到它,如下所示:
输入:
final_df = df.pivot(columns=['rd', 'ppi', 'trial', 'type'])
输出:
final_df
data
rd 7% 12%
ppi 5_ppi 10_ppi 5_ppi 10_ppi
trial trial_1 trial_2 trial_1 trial_2 trial_1 trial_2 trial_1 trial_2
type ext load ext load ext load ext load ext load ext load ext load ext load
0 0.953832 0.012920 0.929764 0.069459 0.406110 0.866707 0.372693 0.815767 0.632988 0.310581 0.027626 0.416959 0.742982 0.340738 0.287946 0.294494
1 0.355673 0.663347 0.363117 0.860274 0.619436 0.146213 0.525354 0.038739 0.579613 0.488108 0.734074 0.794760 0.399273 0.517228 0.736619 0.860785
[2 rows x 16 columns]
转瞬即逝!
我上下查找但找不到与我正在尝试做的非常相似的示例。简而言之,我对具有多个变量的实验进行了多次试验。数据被收集并存储在一个长数据帧 object 中,就像这样(生成的数据示例):
输入:
import pandas as pd
import itertools
from scipy import rand
trials = ['trial_1' , 'trial_2']
ppi = ['5_ppi', '10_ppi']
rd = ['7%', '12%']
iter = [rd, ppi, trials]
df = pd.DataFrame()
for k, params in enumerate(list(itertools.product(*iter))):
ext_data = [rand(), rand()]
load_data = [rand(), rand()]
df1 = pd.DataFrame({'ext': ext_data, 'load': load_data, 'trial': params[2], 'ppi': params[1], 'rd': params[0]})
df = pd.concat([df, df1], axis=0)
print(df)
输出:
df
ext load trial ppi rd
0 0.287997 0.874457 trial_1 5_ppi 7%
1 0.783776 0.878291 trial_1 5_ppi 7%
0 0.015054 0.886801 trial_2 5_ppi 7%
1 0.243617 0.234560 trial_2 5_ppi 7%
0 0.291621 0.519084 trial_1 10_ppi 7%
1 0.627786 0.072551 trial_1 10_ppi 7%
0 0.349199 0.235718 trial_2 10_ppi 7%
1 0.284535 0.328547 trial_2 10_ppi 7%
0 0.725747 0.688157 trial_1 5_ppi 12%
1 0.656839 0.297645 trial_1 5_ppi 12%
0 0.534276 0.794199 trial_2 5_ppi 12%
1 0.680596 0.381575 trial_2 5_ppi 12%
0 0.494404 0.246841 trial_1 10_ppi 12%
1 0.148489 0.549250 trial_1 10_ppi 12%
0 0.791440 0.372119 trial_2 10_ppi 12%
1 0.078047 0.552541 trial_2 10_ppi 12%
[16 rows x 5 columns]
(实际上每个试验有数百个数据点,而不是这里显示的两个)
我想将其转换为具有分层列的宽数据框,组织方式如下(我无法使用 markdown 使 values/headers 跨越多个列,所以请原谅截图)
我已经尝试了很多东西,但我已经失去了方向。我认为 pivot_table
让我最接近我想要的,但它最终会汇总值的平均值而不是列出它们,如下所示:
输入:
df.pivot_table(columns=['rd', 'ppi', 'trial'])
输出:
rd 12% 7%
ppi 10_ppi 5_ppi 10_ppi 5_ppi
trial trial_1 trial_2 trial_1 trial_2 trial_1 trial_2 trial_1 trial_2
ext 0.619217 0.661812 0.652555 0.167241 0.340024 0.42324 0.565166 0.436858
load 0.397430 0.102965 0.528162 0.550871 0.206560 0.28524 0.731204 0.303079
[2 rows x 8 columns]
也有可能每组数据有不同数量的数据点(即试验 1 可能包含 2 个数据点,而试验 2 可能有 3 个)。我束手无策。将这个长数据帧 object 转换为具有多个分层列的宽数据帧 object 的神奇命令是什么?
这可能不是最有效的答案,但我认为它正在实现您想要的结果并且应该适用于任何数量的观察。
def reformat(df, column_name):
groups = df.groupby(['rd','ppi','trial'])[column_name].apply(list)
temp_1 = groups.reset_index(name = 'listvalues')
# Make names for each of your new columns
col_names = [column_name + str(i) for i in np.arange(len(temp_1['listvalues'][0]))]
# Split listvalues into df where every item is its own column
listvalues = pd.DataFrame(temp_1["listvalues"].to_list(), columns=[col_names])
# Merge listvalues with the temp_1 df and get rid of the extra listvalues column
df = temp_1.join(listvalues)
del df['listvalues']
df = df.pivot_table(columns=['rd', 'ppi', 'trial'])
return df
# Do this twice and append to get the desired dataframe
df2 = reformat(df, 'ext')
df3 = reformat(df, 'load')
finaldf = df2.append(df3)
finaldf 的样子:
试试这个:
idx_cols = [*df][-1:-4:-1]
res = df.set_index(idx_cols + [df.groupby(idx_cols).cumcount()])
res = res.stack().unstack([0,1,2,-1])
print(res)
我想出了一个可能是更“正确”的解决方案,但它需要我重新排列我的 原始 数据集,以便 .pivot()
方法将正常工作。这是我 re-cast 数据的方式:
输入:
import pandas as pd
import itertools
from scipy import rand
trials = ['trial_1' , 'trial_2']
ppi = ['5_ppi', '10_ppi']
rd = ['7%', '12%']
iter = [rd, ppi, trials]
df = pd.DataFrame()
for k, params in enumerate(list(itertools.product(*iter))):
ext_data = [rand(), rand()]
load_data = [rand(), rand()]
df1 = pd.DataFrame({'data': ext_data, 'type': 'ext', 'trial': params[2], 'ppi': params[1], 'rd': params[0]})
df2 = pd.concat([df1, pd.DataFrame({'data': load_data, 'type': 'load', 'trial': params[2], 'ppi': params[1], 'rd': params[0]})])
df = pd.concat([df, df2], axis=0)
print(df)
输出:
df
data type trial ppi rd
0 0.959315 ext trial_1 5_ppi 7%
1 0.394340 ext trial_1 5_ppi 7%
0 0.140045 load trial_1 5_ppi 7%
1 0.519967 load trial_1 5_ppi 7%
0 0.483302 ext trial_2 5_ppi 7%
1 0.552380 ext trial_2 5_ppi 7%
0 0.907199 load trial_2 5_ppi 7%
1 0.123719 load trial_2 5_ppi 7%
0 0.190914 ext trial_1 10_ppi 7%
1 0.053163 ext trial_1 10_ppi 7%
0 0.085914 load trial_1 10_ppi 7%
1 0.749197 load trial_1 10_ppi 7%
0 0.112615 ext trial_2 10_ppi 7%
1 0.363111 ext trial_2 10_ppi 7%
0 0.508180 load trial_2 10_ppi 7%
1 0.459821 load trial_2 10_ppi 7%
0 0.346808 ext trial_1 5_ppi 12%
1 0.322950 ext trial_1 5_ppi 12%
0 0.642119 load trial_1 5_ppi 12%
1 0.101987 load trial_1 5_ppi 12%
0 0.488866 ext trial_2 5_ppi 12%
1 0.583071 ext trial_2 5_ppi 12%
0 0.119333 load trial_2 5_ppi 12%
1 0.800356 load trial_2 5_ppi 12%
0 0.733883 ext trial_1 10_ppi 12%
1 0.856037 ext trial_1 10_ppi 12%
0 0.980035 load trial_1 10_ppi 12%
1 0.364698 load trial_1 10_ppi 12%
0 0.697155 ext trial_2 10_ppi 12%
1 0.712375 ext trial_2 10_ppi 12%
0 0.285191 load trial_2 10_ppi 12%
1 0.198097 load trial_2 10_ppi 12%
[32 rows x 5 columns]
然后,如上所示,将数据 完全 投射,我将 .pivot()
应用到它,如下所示:
输入:
final_df = df.pivot(columns=['rd', 'ppi', 'trial', 'type'])
输出:
final_df
data
rd 7% 12%
ppi 5_ppi 10_ppi 5_ppi 10_ppi
trial trial_1 trial_2 trial_1 trial_2 trial_1 trial_2 trial_1 trial_2
type ext load ext load ext load ext load ext load ext load ext load ext load
0 0.953832 0.012920 0.929764 0.069459 0.406110 0.866707 0.372693 0.815767 0.632988 0.310581 0.027626 0.416959 0.742982 0.340738 0.287946 0.294494
1 0.355673 0.663347 0.363117 0.860274 0.619436 0.146213 0.525354 0.038739 0.579613 0.488108 0.734074 0.794760 0.399273 0.517228 0.736619 0.860785
[2 rows x 16 columns]
转瞬即逝!