使用参考列表对多索引数据框进行排序
Sort multi index dataframe using a reference list
给定一个多索引 df
如下
mylevelA caseA VAR_A mylevelA_caseA__VAR_A bar one -0.054973 -0.092080
caseC VAR_B mylevelA_caseC__VAR_B bar two -0.282347 0.882559
VAR_A mylevelA_caseC__VAR_A baz one -0.691023 0.879495
caseB VAR_B mylevelA_caseB__VAR_B baz two -0.321049 1.036407
caseA VAR_C mylevelA_caseA__VAR_C foo one -0.411117 0.523282
caseB VAR_C_D mylevelA_caseB__VAR_C_D foo two -0.998682 0.232587
caseC VAR_E mylevelA_caseC__VAR_E qux one 0.690079 0.985688
caseD VAR_A mylevelA_caseD__VAR_A qux two -2.151700 0.554983
我想根据列表对级别=1进行排序
order_list=[caseC,caseB,caseD,caseA]
这将产生以下结果,
col1 col2
mylevelA
caseC VAR_A mylevelA_caseC__VAR_A baz one 1.135174 -0.547376
VAR_E mylevelA_caseC__VAR_E qux one 0.021435 -0.047488
VAR_B mylevelA_caseC__VAR_B bar two -0.892378 2.649619
caseB VAR_C_D mylevelA_caseB__VAR_C_D foo two 1.945302 -1.848938
VAR_B mylevelA_caseB__VAR_B baz two -2.552820 1.025900
caseD VAR_A mylevelA_caseD__VAR_A qux two -0.833289 -1.478944
caseA VAR_C mylevelA_caseA__VAR_C foo one 1.269452 0.956567
我觉得这可以使用 sort_values
和 sort_index
来解决
df=df.sort_values(df.columns.tolist()).sort_index(level=1, ascending=False,
sort_remaining=False)
然而,sort_index
只有参数ascending
。
此外,使用上面的表达式,我得到了以下输出
import pandas as pd
import numpy as np
import re
from itertools import chain
arrays = [["mylevelA_caseA__VAR_A", "mylevelA_caseC__VAR_B", "mylevelA_caseC__VAR_A",
"mylevelA_caseB__VAR_B", "mylevelA_caseA__VAR_C", "mylevelA_caseB__VAR_C_D",
"mylevelA_caseC__VAR_E", "mylevelA_caseD__VAR_A"],
["bar", "bar", "baz", "baz", "foo", "foo", "qux", "qux"],
["one", "two", "one", "two", "one", "two", "one", "two"]]
df = pd.DataFrame(np.random.randn(8, 2), index=arrays,columns=['col1','col2'])
df.index = pd.MultiIndex.from_tuples([re.split('__?',e[0], maxsplit=2)+list(e)
for e in df.index])
df=df.sort_values(df.columns.tolist()).sort_index(level=1, ascending=False,
sort_remaining=False)
输出
col1 col2
mylevelA caseD VAR_A mylevelA_caseD__VAR_A qux two 1.240834 -0.097545
caseC VAR_B mylevelA_caseC__VAR_B bar two -0.293481 1.342649
VAR_E mylevelA_caseC__VAR_E qux one -0.581308 -1.370208
VAR_A mylevelA_caseC__VAR_A baz one -1.179519 1.006746
caseB VAR_C_D mylevelA_caseB__VAR_C_D foo two 0.430511 0.447371
VAR_B mylevelA_caseB__VAR_B baz two -0.355763 -1.794507
caseA VAR_A mylevelA_caseA__VAR_A bar one 0.747331 -0.476303
VAR_C mylevelA_caseA__VAR_C foo one -0.702220 0.237277
我的问题是,我们如何使用给定的 list_order
对多索引顺序进行排序?
不使用sort_index
,可以使用reindex()
,如下:
order_list=['caseC','caseB','caseD','caseA']
df.reindex(level=1, labels=order_list)
结果:
col1 col2
mylevelA caseC VAR_B mylevelA_caseC__VAR_B bar two 1.536922 -1.285441
VAR_A mylevelA_caseC__VAR_A baz one 0.734785 0.845596
VAR_E mylevelA_caseC__VAR_E qux one -0.577822 -0.689958
caseB VAR_B mylevelA_caseB__VAR_B baz two -0.740523 0.345331
VAR_C_D mylevelA_caseB__VAR_C_D foo two 0.534257 -0.120670
caseD VAR_A mylevelA_caseD__VAR_A qux two 1.327925 0.242728
caseA VAR_A mylevelA_caseA__VAR_A bar one 1.530633 -0.190661
VAR_C mylevelA_caseA__VAR_C foo one -0.290205 -0.323746
分类类型是可能的。此解决方案适用于 sort_index。将此添加到您的代码中:
cat_type = pd.CategoricalDtype(
categories=["caseC", "caseB", "caseD", "caseA"], ordered=True
)
df.reset_index(inplace=True)
df["level_1"] = df["level_1"].astype(cat_type)
df = (
df.set_index([i for i in df.columns if i.startswith("level_")])
.sort_index(level=1, ascending=True, sort_remaining=False)
)
df.rename_axis(index=df.index.nlevels * [None], inplace=True)
输出将是:
col1 col2
mylevelA caseC VAR_A mylevelA_caseC__VAR_A baz one 0.095391 1.723488
VAR_E mylevelA_caseC__VAR_E qux one -0.505066 0.871808
VAR_B mylevelA_caseC__VAR_B bar two -1.223648 -0.468713
caseB VAR_C_D mylevelA_caseB__VAR_C_D foo two -0.747988 0.794639
VAR_B mylevelA_caseB__VAR_B baz two -0.749597 1.385091
caseD VAR_A mylevelA_caseD__VAR_A qux two -1.071768 0.920789
caseA VAR_A mylevelA_caseA__VAR_A bar one 1.670896 -2.067492
VAR_C mylevelA_caseA__VAR_C foo one 0.437768 0.417799
给定一个多索引 df
如下
mylevelA caseA VAR_A mylevelA_caseA__VAR_A bar one -0.054973 -0.092080
caseC VAR_B mylevelA_caseC__VAR_B bar two -0.282347 0.882559
VAR_A mylevelA_caseC__VAR_A baz one -0.691023 0.879495
caseB VAR_B mylevelA_caseB__VAR_B baz two -0.321049 1.036407
caseA VAR_C mylevelA_caseA__VAR_C foo one -0.411117 0.523282
caseB VAR_C_D mylevelA_caseB__VAR_C_D foo two -0.998682 0.232587
caseC VAR_E mylevelA_caseC__VAR_E qux one 0.690079 0.985688
caseD VAR_A mylevelA_caseD__VAR_A qux two -2.151700 0.554983
我想根据列表对级别=1进行排序
order_list=[caseC,caseB,caseD,caseA]
这将产生以下结果,
col1 col2
mylevelA
caseC VAR_A mylevelA_caseC__VAR_A baz one 1.135174 -0.547376
VAR_E mylevelA_caseC__VAR_E qux one 0.021435 -0.047488
VAR_B mylevelA_caseC__VAR_B bar two -0.892378 2.649619
caseB VAR_C_D mylevelA_caseB__VAR_C_D foo two 1.945302 -1.848938
VAR_B mylevelA_caseB__VAR_B baz two -2.552820 1.025900
caseD VAR_A mylevelA_caseD__VAR_A qux two -0.833289 -1.478944
caseA VAR_C mylevelA_caseA__VAR_C foo one 1.269452 0.956567
我觉得这可以使用 sort_values
和 sort_index
df=df.sort_values(df.columns.tolist()).sort_index(level=1, ascending=False,
sort_remaining=False)
然而,sort_index
只有参数ascending
。
此外,使用上面的表达式,我得到了以下输出
import pandas as pd
import numpy as np
import re
from itertools import chain
arrays = [["mylevelA_caseA__VAR_A", "mylevelA_caseC__VAR_B", "mylevelA_caseC__VAR_A",
"mylevelA_caseB__VAR_B", "mylevelA_caseA__VAR_C", "mylevelA_caseB__VAR_C_D",
"mylevelA_caseC__VAR_E", "mylevelA_caseD__VAR_A"],
["bar", "bar", "baz", "baz", "foo", "foo", "qux", "qux"],
["one", "two", "one", "two", "one", "two", "one", "two"]]
df = pd.DataFrame(np.random.randn(8, 2), index=arrays,columns=['col1','col2'])
df.index = pd.MultiIndex.from_tuples([re.split('__?',e[0], maxsplit=2)+list(e)
for e in df.index])
df=df.sort_values(df.columns.tolist()).sort_index(level=1, ascending=False,
sort_remaining=False)
输出
col1 col2
mylevelA caseD VAR_A mylevelA_caseD__VAR_A qux two 1.240834 -0.097545
caseC VAR_B mylevelA_caseC__VAR_B bar two -0.293481 1.342649
VAR_E mylevelA_caseC__VAR_E qux one -0.581308 -1.370208
VAR_A mylevelA_caseC__VAR_A baz one -1.179519 1.006746
caseB VAR_C_D mylevelA_caseB__VAR_C_D foo two 0.430511 0.447371
VAR_B mylevelA_caseB__VAR_B baz two -0.355763 -1.794507
caseA VAR_A mylevelA_caseA__VAR_A bar one 0.747331 -0.476303
VAR_C mylevelA_caseA__VAR_C foo one -0.702220 0.237277
我的问题是,我们如何使用给定的 list_order
对多索引顺序进行排序?
不使用sort_index
,可以使用reindex()
,如下:
order_list=['caseC','caseB','caseD','caseA']
df.reindex(level=1, labels=order_list)
结果:
col1 col2
mylevelA caseC VAR_B mylevelA_caseC__VAR_B bar two 1.536922 -1.285441
VAR_A mylevelA_caseC__VAR_A baz one 0.734785 0.845596
VAR_E mylevelA_caseC__VAR_E qux one -0.577822 -0.689958
caseB VAR_B mylevelA_caseB__VAR_B baz two -0.740523 0.345331
VAR_C_D mylevelA_caseB__VAR_C_D foo two 0.534257 -0.120670
caseD VAR_A mylevelA_caseD__VAR_A qux two 1.327925 0.242728
caseA VAR_A mylevelA_caseA__VAR_A bar one 1.530633 -0.190661
VAR_C mylevelA_caseA__VAR_C foo one -0.290205 -0.323746
分类类型是可能的。此解决方案适用于 sort_index。将此添加到您的代码中:
cat_type = pd.CategoricalDtype(
categories=["caseC", "caseB", "caseD", "caseA"], ordered=True
)
df.reset_index(inplace=True)
df["level_1"] = df["level_1"].astype(cat_type)
df = (
df.set_index([i for i in df.columns if i.startswith("level_")])
.sort_index(level=1, ascending=True, sort_remaining=False)
)
df.rename_axis(index=df.index.nlevels * [None], inplace=True)
输出将是:
col1 col2
mylevelA caseC VAR_A mylevelA_caseC__VAR_A baz one 0.095391 1.723488
VAR_E mylevelA_caseC__VAR_E qux one -0.505066 0.871808
VAR_B mylevelA_caseC__VAR_B bar two -1.223648 -0.468713
caseB VAR_C_D mylevelA_caseB__VAR_C_D foo two -0.747988 0.794639
VAR_B mylevelA_caseB__VAR_B baz two -0.749597 1.385091
caseD VAR_A mylevelA_caseD__VAR_A qux two -1.071768 0.920789
caseA VAR_A mylevelA_caseA__VAR_A bar one 1.670896 -2.067492
VAR_C mylevelA_caseA__VAR_C foo one 0.437768 0.417799