使用现有索引名称和 reindex Pandas 扩展多级数据框
Extend multilevel dataframe using existing index name with reindex Pandas
objective 加深现有的多重索引 df
。
这样,给定一个 df
如下
col1 col2
mylevelA_caseA__VAR_A bar one -1.012046 0.808332
mylevelA_caseA__VAR_B bar two -0.558629 -0.358550
mylevelA_caseB__VAR_A baz one 1.514448 -1.045073
mylevelA_caseB__VAR_B baz two 1.268511 -1.100705
mylevelB_caseC__VAR_C foo one -2.108172 -1.694602
mylevelB_caseC__VAR_C_D foo two -0.629493 -0.005071
mylevelB_caseC__VAR_E qux one 0.596771 -0.964429
mylevelB_caseD__VAR_A qux two 0.257154 -0.248278
我想将多级索引扩展成类似的东西。
在这个阶段,请注意在第一个索引级别,关键字 VAR
之前有两个 __
。
为了实现类似上图的效果,拟定了以下代码
import pandas as pd
import numpy as np
arrays = [["mylevelA_caseA__VAR_A", "mylevelA_caseA__VAR_B", "mylevelA_caseB__VAR_A",
"mylevelA_caseB__VAR_B", "mylevelB_caseC__VAR_C", "mylevelB_caseC__VAR_C_D",
"mylevelB_caseC__VAR_E", "mylevelB_caseD__VAR_A"],
["bar", "bar", "baz", "baz", "foo", "foo", "qux", "qux"],
["one", "two", "one", "two", "one", "two", "one", "two"]]
df = pd.DataFrame(np.random.randn(8, 2), index=arrays,columns=['col1','col2'])
# print(df)
idx_ls=df.index.values.tolist()
new_multiindex=[]
for x in idx_ls:
b=x[0]
vv=b.split('_')
c=[]
new_data=[]
mvar=[]
for xx in vv:
if not c:
if xx:
new_data.append(xx)
else:
c=1
else:
if xx:
mvar.append(xx)
ntuple=(*new_data,"_ ".join(mvar),*x )
new_multiindex.append(ntuple)
t=1
df=df.reindex(ne
w_multiindex,copy=True)
print(df)
产生了
col1 col2
mylevelA caseA VAR_ A mylevelA_caseA__VAR_A bar one NaN NaN
VAR_ B mylevelA_caseA__VAR_B bar two NaN NaN
caseB VAR_ A mylevelA_caseB__VAR_A baz one NaN NaN
VAR_ B mylevelA_caseB__VAR_B baz two NaN NaN
mylevelB caseC VAR_ C mylevelB_caseC__VAR_C foo one NaN NaN
VAR_ C_ D mylevelB_caseC__VAR_C_D foo two NaN NaN
VAR_ E mylevelB_caseC__VAR_E qux one NaN NaN
caseD VAR_ A mylevelB_caseD__VAR_A qux two NaN NaN
有两个问题。
第一:col1
和col2
returnnan
其次:请问有没有更紧凑的方法来最小化for
循环中的代码行。
在索引上使用一个小的列表理解并创建一个新的多索引:
import re
from itertools import chain
df.index = pd.MultiIndex.from_tuples([tuple(chain(re.split('__?', e[0], maxsplit=2),
e[1:]))
for e in df.index])
或更简单的版本:
import re
df.index = pd.MultiIndex.from_tuples([re.split('__?',e[0], maxsplit=2)+list(e[1:])
for e in df.index])
输出:
col1 col2
mylevelA caseA VAR_A bar one -0.327934 -0.071217
VAR_B bar two -0.344340 0.969293
caseB VAR_A baz one 0.536292 -0.000917
VAR_B baz two 0.632327 -0.493869
mylevelB caseC VAR_C foo one -0.253687 0.543698
VAR_C_D foo two -0.239579 1.188864
VAR_E qux one -1.450289 -0.756109
caseD VAR_A qux two 1.213411 1.237863
要包括原始长索引:
import re
df.index = pd.MultiIndex.from_tuples([re.split('__?',e[0], maxsplit=2)+list(e)
for e in df.index])
自定义拆分(拆分所有“_”直到“__”):
def custom_split(s):
a,b = s.split('__')
return a.split('_')+[b]
df.index = pd.MultiIndex.from_tuples([custom_split(e[0])+list(e)
for e in df.index])
你也可以,试试这个:
df.set_index(
df.index.get_level_values(0).str.split("_", n=3, expand=True), append=True
).droplevel(5).reorder_levels([3, 4, 5, 0, 1, 2])
输出:
col1 col2
mylevelA caseA VAR_A mylevelA_caseA__VAR_A bar one 2.925263 0.065379
VAR_B mylevelA_caseA__VAR_B bar two -1.544370 0.383090
caseB VAR_A mylevelA_caseB__VAR_A baz one -0.260279 -0.264885
VAR_B mylevelA_caseB__VAR_B baz two 0.071172 -0.201748
mylevelB caseC VAR_C mylevelB_caseC__VAR_C foo one -0.319578 -0.909871
VAR_C_D mylevelB_caseC__VAR_C_D foo two -1.058169 -0.465444
VAR_E mylevelB_caseC__VAR_E qux one -0.432982 -1.999376
caseD VAR_A mylevelB_caseD__VAR_A qux two -0.704989 -0.298849
objective 加深现有的多重索引 df
。
这样,给定一个 df
如下
col1 col2
mylevelA_caseA__VAR_A bar one -1.012046 0.808332
mylevelA_caseA__VAR_B bar two -0.558629 -0.358550
mylevelA_caseB__VAR_A baz one 1.514448 -1.045073
mylevelA_caseB__VAR_B baz two 1.268511 -1.100705
mylevelB_caseC__VAR_C foo one -2.108172 -1.694602
mylevelB_caseC__VAR_C_D foo two -0.629493 -0.005071
mylevelB_caseC__VAR_E qux one 0.596771 -0.964429
mylevelB_caseD__VAR_A qux two 0.257154 -0.248278
我想将多级索引扩展成类似的东西。
在这个阶段,请注意在第一个索引级别,关键字 VAR
之前有两个 __
。
为了实现类似上图的效果,拟定了以下代码
import pandas as pd
import numpy as np
arrays = [["mylevelA_caseA__VAR_A", "mylevelA_caseA__VAR_B", "mylevelA_caseB__VAR_A",
"mylevelA_caseB__VAR_B", "mylevelB_caseC__VAR_C", "mylevelB_caseC__VAR_C_D",
"mylevelB_caseC__VAR_E", "mylevelB_caseD__VAR_A"],
["bar", "bar", "baz", "baz", "foo", "foo", "qux", "qux"],
["one", "two", "one", "two", "one", "two", "one", "two"]]
df = pd.DataFrame(np.random.randn(8, 2), index=arrays,columns=['col1','col2'])
# print(df)
idx_ls=df.index.values.tolist()
new_multiindex=[]
for x in idx_ls:
b=x[0]
vv=b.split('_')
c=[]
new_data=[]
mvar=[]
for xx in vv:
if not c:
if xx:
new_data.append(xx)
else:
c=1
else:
if xx:
mvar.append(xx)
ntuple=(*new_data,"_ ".join(mvar),*x )
new_multiindex.append(ntuple)
t=1
df=df.reindex(ne
w_multiindex,copy=True)
print(df)
产生了
col1 col2
mylevelA caseA VAR_ A mylevelA_caseA__VAR_A bar one NaN NaN
VAR_ B mylevelA_caseA__VAR_B bar two NaN NaN
caseB VAR_ A mylevelA_caseB__VAR_A baz one NaN NaN
VAR_ B mylevelA_caseB__VAR_B baz two NaN NaN
mylevelB caseC VAR_ C mylevelB_caseC__VAR_C foo one NaN NaN
VAR_ C_ D mylevelB_caseC__VAR_C_D foo two NaN NaN
VAR_ E mylevelB_caseC__VAR_E qux one NaN NaN
caseD VAR_ A mylevelB_caseD__VAR_A qux two NaN NaN
有两个问题。
第一:col1
和col2
returnnan
其次:请问有没有更紧凑的方法来最小化for
循环中的代码行。
在索引上使用一个小的列表理解并创建一个新的多索引:
import re
from itertools import chain
df.index = pd.MultiIndex.from_tuples([tuple(chain(re.split('__?', e[0], maxsplit=2),
e[1:]))
for e in df.index])
或更简单的版本:
import re
df.index = pd.MultiIndex.from_tuples([re.split('__?',e[0], maxsplit=2)+list(e[1:])
for e in df.index])
输出:
col1 col2
mylevelA caseA VAR_A bar one -0.327934 -0.071217
VAR_B bar two -0.344340 0.969293
caseB VAR_A baz one 0.536292 -0.000917
VAR_B baz two 0.632327 -0.493869
mylevelB caseC VAR_C foo one -0.253687 0.543698
VAR_C_D foo two -0.239579 1.188864
VAR_E qux one -1.450289 -0.756109
caseD VAR_A qux two 1.213411 1.237863
要包括原始长索引:
import re
df.index = pd.MultiIndex.from_tuples([re.split('__?',e[0], maxsplit=2)+list(e)
for e in df.index])
自定义拆分(拆分所有“_”直到“__”):
def custom_split(s):
a,b = s.split('__')
return a.split('_')+[b]
df.index = pd.MultiIndex.from_tuples([custom_split(e[0])+list(e)
for e in df.index])
你也可以,试试这个:
df.set_index(
df.index.get_level_values(0).str.split("_", n=3, expand=True), append=True
).droplevel(5).reorder_levels([3, 4, 5, 0, 1, 2])
输出:
col1 col2
mylevelA caseA VAR_A mylevelA_caseA__VAR_A bar one 2.925263 0.065379
VAR_B mylevelA_caseA__VAR_B bar two -1.544370 0.383090
caseB VAR_A mylevelA_caseB__VAR_A baz one -0.260279 -0.264885
VAR_B mylevelA_caseB__VAR_B baz two 0.071172 -0.201748
mylevelB caseC VAR_C mylevelB_caseC__VAR_C foo one -0.319578 -0.909871
VAR_C_D mylevelB_caseC__VAR_C_D foo two -1.058169 -0.465444
VAR_E mylevelB_caseC__VAR_E qux one -0.432982 -1.999376
caseD VAR_A mylevelB_caseD__VAR_A qux two -0.704989 -0.298849