从四个 2d numpy 数组创建具有多级列索引的数据框
Creating dataframe with multi level column index from from four 2d numpy arrays
我有四个二维 numpy 数组:
import numpy as np
import pandas as pd
x1 = np.array([[2, 4, 1],
[2, 2, 1],
[1, 3, 3],
[2, 2, 1],
[3, 3, 2]])
x2 = np.array([[1, 2, 2],
[4, 1, 4],
[1, 4, 4],
[3, 3, 2],
[2, 2, 4]])
x3 = np.array([[4, 3, 2],
[4, 3, 2],
[4, 3, 3],
[1, 2, 2],
[1, 4, 3]])
x4 = np.array([[3, 1, 1],
[3, 4, 3],
[2, 2, 1],
[2, 1, 1],
[1, 2, 4]])
我想创建一个数据框如下:
level_1_label = ['location1','location2','location3']
level_2_label = ['x1','x2','x3','x4']
header = pd.MultiIndex.from_product([level_1_label, level_2_label], names=['Location','Variable'])
df = pd.DataFrame(np.concatenate((x1,x1,x3,x4),axis=1), columns=header)
df.index.name = 'Time'
df
此 DataFrame
中的数据不是所需的格式。
我希望第一级列标签 (location1) 中的四列 (x1,x2,x3,x4) 应该通过从所有 numpy 数组中获取第一列来创建。接下来的四列 (x1,x2,x3,x4) 即。第二个一级列标签(location2)中的四列应该通过从所有四个 numpy 数组中获取第二列来创建,依此类推。第一级列标签的长度,即。 len(level_1_label)
将等于所有四个 2d numpy 数组中的列数。
想要DataFrame
:
一个选项是颠倒创建 MultiIndex 列的顺序(因为 level_1_label
对应于列而 level_2_label
对应于数组);然后 swaplevel
+ sort_index
(以所需的顺序获取它)在构建 DataFrame 之后:
level_1_label = ['location1','location2','location3']
level_2_label = ['x1','x2','x3','x4']
header = pd.MultiIndex.from_product([level_2_label, level_1_label], names=['Variable','Location'])
df = pd.DataFrame(np.concatenate((x1,x2,x3,x4),axis=1), columns=header).swaplevel(axis=1).sort_index(level=0, axis=1)
df.index.name = 'Time'
输出:
Location location1 location2 location3
Variable x1 x2 x3 x4 x1 x2 x3 x4 x1 x2 x3 x4
Time
0 2 1 4 3 4 2 3 1 1 2 2 1
1 2 4 4 3 2 1 3 4 1 4 2 3
2 1 1 4 2 3 4 3 2 3 4 3 1
3 2 3 1 2 2 3 2 1 1 2 2 1
4 3 2 1 1 3 2 4 2 2 4 3 4
一个选项是在创建数据框之前按 Fortran 顺序重塑数据:
# reusing your code
level_1_label = ['location1','location2','location3']
level_2_label = ['x1','x2','x3','x4']
header = pd.MultiIndex.from_product([level_1_label, level_2_label], names=['Location','Variable'])
# np.vstack is just a convenience wrapper around np.concatenate, axis=1
outcome = np.reshape(np.vstack([x1,x2,x3,x4]), (len(x1), -1), order = 'F')
df = pd.DataFrame(outcome, columns = header)
df.index.name = 'Time'
df
Location location1 location2 location3
Variable x1 x2 x3 x4 x1 x2 x3 x4 x1 x2 x3 x4
Time
0 2 1 4 3 4 2 3 1 1 2 2 1
1 2 4 4 3 2 1 3 4 1 4 2 3
2 1 1 4 2 3 4 3 2 3 4 3 1
3 2 3 1 2 2 3 2 1 1 2 2 1
4 3 2 1 1 3 2 4 2 2 4 3 4
我有四个二维 numpy 数组:
import numpy as np
import pandas as pd
x1 = np.array([[2, 4, 1],
[2, 2, 1],
[1, 3, 3],
[2, 2, 1],
[3, 3, 2]])
x2 = np.array([[1, 2, 2],
[4, 1, 4],
[1, 4, 4],
[3, 3, 2],
[2, 2, 4]])
x3 = np.array([[4, 3, 2],
[4, 3, 2],
[4, 3, 3],
[1, 2, 2],
[1, 4, 3]])
x4 = np.array([[3, 1, 1],
[3, 4, 3],
[2, 2, 1],
[2, 1, 1],
[1, 2, 4]])
我想创建一个数据框如下:
level_1_label = ['location1','location2','location3']
level_2_label = ['x1','x2','x3','x4']
header = pd.MultiIndex.from_product([level_1_label, level_2_label], names=['Location','Variable'])
df = pd.DataFrame(np.concatenate((x1,x1,x3,x4),axis=1), columns=header)
df.index.name = 'Time'
df
此 DataFrame
中的数据不是所需的格式。
我希望第一级列标签 (location1) 中的四列 (x1,x2,x3,x4) 应该通过从所有 numpy 数组中获取第一列来创建。接下来的四列 (x1,x2,x3,x4) 即。第二个一级列标签(location2)中的四列应该通过从所有四个 numpy 数组中获取第二列来创建,依此类推。第一级列标签的长度,即。 len(level_1_label)
将等于所有四个 2d numpy 数组中的列数。
想要DataFrame
:
一个选项是颠倒创建 MultiIndex 列的顺序(因为 level_1_label
对应于列而 level_2_label
对应于数组);然后 swaplevel
+ sort_index
(以所需的顺序获取它)在构建 DataFrame 之后:
level_1_label = ['location1','location2','location3']
level_2_label = ['x1','x2','x3','x4']
header = pd.MultiIndex.from_product([level_2_label, level_1_label], names=['Variable','Location'])
df = pd.DataFrame(np.concatenate((x1,x2,x3,x4),axis=1), columns=header).swaplevel(axis=1).sort_index(level=0, axis=1)
df.index.name = 'Time'
输出:
Location location1 location2 location3
Variable x1 x2 x3 x4 x1 x2 x3 x4 x1 x2 x3 x4
Time
0 2 1 4 3 4 2 3 1 1 2 2 1
1 2 4 4 3 2 1 3 4 1 4 2 3
2 1 1 4 2 3 4 3 2 3 4 3 1
3 2 3 1 2 2 3 2 1 1 2 2 1
4 3 2 1 1 3 2 4 2 2 4 3 4
一个选项是在创建数据框之前按 Fortran 顺序重塑数据:
# reusing your code
level_1_label = ['location1','location2','location3']
level_2_label = ['x1','x2','x3','x4']
header = pd.MultiIndex.from_product([level_1_label, level_2_label], names=['Location','Variable'])
# np.vstack is just a convenience wrapper around np.concatenate, axis=1
outcome = np.reshape(np.vstack([x1,x2,x3,x4]), (len(x1), -1), order = 'F')
df = pd.DataFrame(outcome, columns = header)
df.index.name = 'Time'
df
Location location1 location2 location3
Variable x1 x2 x3 x4 x1 x2 x3 x4 x1 x2 x3 x4
Time
0 2 1 4 3 4 2 3 1 1 2 2 1
1 2 4 4 3 2 1 3 4 1 4 2 3
2 1 1 4 2 3 4 3 2 3 4 3 1
3 2 3 1 2 2 3 2 1 1 2 2 1
4 3 2 1 1 3 2 4 2 2 4 3 4