重塑 pandas 相关矩阵
Reshaping a pandas correlation matrix
我有以下使用 pandas 创建的相关矩阵:df.corr()
symbol aaa bbb ccc ddd eee
symbol
aaa 1.000000 0.346099 0.131874 -0.150910 0.177589
bbb 0.346099 1.000000 0.177308 -0.384893 0.301150
ccc 0.131874 0.177308 1.000000 -0.176995 0.258812
ddd -0.150910 -0.384893 -0.176995 1.000000 -0.310137
eee 0.177589 0.301150 0.258812 -0.310137 1.000000
根据上面的数据框,我需要将其转换为 3 列数据框,如下所示:
aaa aaa 1.000000
aaa bbb 0.346099
aaa ccc 0.131874
aaa ddd -0.150910
aaa eee 0.177589
bbb aaa 0.346099
bbb bbb 1.000000
bbb ccc 0.177308
bbb ddd -0.384893
bbb eee 0.301150
ccc aaa 0.131874
ccc bbb 0.177308
ccc ccc 1.000000
ccc ddd -0.176995
ccc eee 0.258812
ddd aaa -0.150910
ddd bbb -0.384893
ddd ccc -0.176995
ddd ddd 1.000000
ddd eee -0.310137
eee aaa 0.177589
eee bbb 0.301150
eee ccc 0.258812
eee ddd -0.310137
eee eee 1.000000
如图所示,同样是数据,只是呈现方式不同。原始数据帧中的每个 column/row 对都简单地分组到新数据帧中它自己的行中。
不幸的是,我无法弄清楚如何在结果是数据帧的情况下完成这项工作。我试过 df.stack()
但结果是 Series
。我需要它是一个数据框,以便我可以使用这些列。 df.stack()
的另一个问题是它不会填写每一行,这里是该问题的一个小示例:
aaa aaa 1.000000
bbb 0.346099
ccc 0.131874
ddd -0.150910
eee 0.177589
bbb aaa 0.346099
bbb 1.000000
ccc 0.177308
ddd -0.384893
eee 0.301150
etc...
您需要添加 reset_index
:
#reset columns and index names
df = df.rename_axis(None).rename_axis(None, axis=1)
#if pandas version below 0.18.0
#df.columns.name = None
#df.index.name = None
print (df)
aaa bbb ccc ddd eee
aaa 1.000000 0.346099 0.131874 -0.150910 0.177589
bbb 0.346099 1.000000 0.177308 -0.384893 0.301150
ccc 0.131874 0.177308 1.000000 -0.176995 0.258812
ddd -0.150910 -0.384893 -0.176995 1.000000 -0.310137
eee 0.177589 0.301150 0.258812 -0.310137 1.000000
df1 = df.stack().reset_index()
#set column names
df1.columns = ['a','b','c']
print (df1)
a b c
0 aaa aaa 1.000000
1 aaa bbb 0.346099
2 aaa ccc 0.131874
3 aaa ddd -0.150910
4 aaa eee 0.177589
5 bbb aaa 0.346099
6 bbb bbb 1.000000
7 bbb ccc 0.177308
8 bbb ddd -0.384893
9 bbb eee 0.301150
10 ccc aaa 0.131874
11 ccc bbb 0.177308
12 ccc ccc 1.000000
13 ccc ddd -0.176995
14 ccc eee 0.258812
15 ddd aaa -0.150910
16 ddd bbb -0.384893
17 ddd ccc -0.176995
18 ddd ddd 1.000000
19 ddd eee -0.310137
20 eee aaa 0.177589
21 eee bbb 0.301150
22 eee ccc 0.258812
23 eee ddd -0.310137
24 eee eee 1.000000
使用下面的代码 (a) 重塑相关矩阵,(b) 删除重复行(例如,{aaa, bbb}
和 {bbb, aaa}
),以及 (c) 删除前两列中包含相同变量的行(例如,{aaa, aaa}
):
# calculate the correlation matrix and reshape
df_corr = df.corr().stack().reset_index()
# rename the columns
df_corr.columns = ['FEATURE_1', 'FEATURE_2', 'CORRELATION']
# create a mask to identify rows with duplicate features as mentioned above
mask_dups = (df_corr[['FEATURE_1', 'FEATURE_2']].apply(frozenset, axis=1).duplicated()) | (df_corr['FEATURE_1']==df_corr['FEATURE_2'])
# apply the mask to clean the correlation dataframe
df_corr = df_corr[~mask_dups]
这将生成如下输出:
FEATURE_1 FEATURE_2 CORRELATION
0 aaa bbb 0.346099
1 aaa ccc 0.131874
2 aaa ddd -0.150910
3 aaa eee 0.177589
4 bbb ccc 0.177308
5 bbb ddd -0.384893
6 bbb eee 0.301150
7 ccc ddd -0.176995
8 ccc eee 0.258812
9 ddd eee -0.310137
单线解决方案:
df.corr().stack().rename_axis(('a', 'b')).reset_index(name='value')
我有以下使用 pandas 创建的相关矩阵:df.corr()
symbol aaa bbb ccc ddd eee
symbol
aaa 1.000000 0.346099 0.131874 -0.150910 0.177589
bbb 0.346099 1.000000 0.177308 -0.384893 0.301150
ccc 0.131874 0.177308 1.000000 -0.176995 0.258812
ddd -0.150910 -0.384893 -0.176995 1.000000 -0.310137
eee 0.177589 0.301150 0.258812 -0.310137 1.000000
根据上面的数据框,我需要将其转换为 3 列数据框,如下所示:
aaa aaa 1.000000
aaa bbb 0.346099
aaa ccc 0.131874
aaa ddd -0.150910
aaa eee 0.177589
bbb aaa 0.346099
bbb bbb 1.000000
bbb ccc 0.177308
bbb ddd -0.384893
bbb eee 0.301150
ccc aaa 0.131874
ccc bbb 0.177308
ccc ccc 1.000000
ccc ddd -0.176995
ccc eee 0.258812
ddd aaa -0.150910
ddd bbb -0.384893
ddd ccc -0.176995
ddd ddd 1.000000
ddd eee -0.310137
eee aaa 0.177589
eee bbb 0.301150
eee ccc 0.258812
eee ddd -0.310137
eee eee 1.000000
如图所示,同样是数据,只是呈现方式不同。原始数据帧中的每个 column/row 对都简单地分组到新数据帧中它自己的行中。
不幸的是,我无法弄清楚如何在结果是数据帧的情况下完成这项工作。我试过 df.stack()
但结果是 Series
。我需要它是一个数据框,以便我可以使用这些列。 df.stack()
的另一个问题是它不会填写每一行,这里是该问题的一个小示例:
aaa aaa 1.000000
bbb 0.346099
ccc 0.131874
ddd -0.150910
eee 0.177589
bbb aaa 0.346099
bbb 1.000000
ccc 0.177308
ddd -0.384893
eee 0.301150
etc...
您需要添加 reset_index
:
#reset columns and index names
df = df.rename_axis(None).rename_axis(None, axis=1)
#if pandas version below 0.18.0
#df.columns.name = None
#df.index.name = None
print (df)
aaa bbb ccc ddd eee
aaa 1.000000 0.346099 0.131874 -0.150910 0.177589
bbb 0.346099 1.000000 0.177308 -0.384893 0.301150
ccc 0.131874 0.177308 1.000000 -0.176995 0.258812
ddd -0.150910 -0.384893 -0.176995 1.000000 -0.310137
eee 0.177589 0.301150 0.258812 -0.310137 1.000000
df1 = df.stack().reset_index()
#set column names
df1.columns = ['a','b','c']
print (df1)
a b c
0 aaa aaa 1.000000
1 aaa bbb 0.346099
2 aaa ccc 0.131874
3 aaa ddd -0.150910
4 aaa eee 0.177589
5 bbb aaa 0.346099
6 bbb bbb 1.000000
7 bbb ccc 0.177308
8 bbb ddd -0.384893
9 bbb eee 0.301150
10 ccc aaa 0.131874
11 ccc bbb 0.177308
12 ccc ccc 1.000000
13 ccc ddd -0.176995
14 ccc eee 0.258812
15 ddd aaa -0.150910
16 ddd bbb -0.384893
17 ddd ccc -0.176995
18 ddd ddd 1.000000
19 ddd eee -0.310137
20 eee aaa 0.177589
21 eee bbb 0.301150
22 eee ccc 0.258812
23 eee ddd -0.310137
24 eee eee 1.000000
使用下面的代码 (a) 重塑相关矩阵,(b) 删除重复行(例如,{aaa, bbb}
和 {bbb, aaa}
),以及 (c) 删除前两列中包含相同变量的行(例如,{aaa, aaa}
):
# calculate the correlation matrix and reshape
df_corr = df.corr().stack().reset_index()
# rename the columns
df_corr.columns = ['FEATURE_1', 'FEATURE_2', 'CORRELATION']
# create a mask to identify rows with duplicate features as mentioned above
mask_dups = (df_corr[['FEATURE_1', 'FEATURE_2']].apply(frozenset, axis=1).duplicated()) | (df_corr['FEATURE_1']==df_corr['FEATURE_2'])
# apply the mask to clean the correlation dataframe
df_corr = df_corr[~mask_dups]
这将生成如下输出:
FEATURE_1 FEATURE_2 CORRELATION
0 aaa bbb 0.346099
1 aaa ccc 0.131874
2 aaa ddd -0.150910
3 aaa eee 0.177589
4 bbb ccc 0.177308
5 bbb ddd -0.384893
6 bbb eee 0.301150
7 ccc ddd -0.176995
8 ccc eee 0.258812
9 ddd eee -0.310137
单线解决方案:
df.corr().stack().rename_axis(('a', 'b')).reset_index(name='value')