将相关数据框转换为字典 {key = (sample_x,sample_y), value = correlation}
Convert correlation dataframe to dictionary {key = (sample_x,sample_y), value = correlation}
所以我正在尝试做一些与这两篇文章非常相似的事情,但有一些不同之处。第一,我不想要 csv 文件,所以没有 csv 模块,我想在 Python 而不是 R 中完成它。
输入:
AF001 AF002 AF003 AF004 AF005
AF001 1.000000 0.000000e+00 0.000000 0.0000 0
AF002 0.374449 1.000000e+00 0.000000 0.0000 0
AF003 0.000347 1.173926e-05 1.000000 0.0000 0
AF004 0.001030 1.494282e-07 0.174526 1.0000 0
AF005 0.001183 1.216664e-06 0.238497 0.7557 1
输出:
{('AF002', 'AF003'): 1.17392596672424e-05, ('AF004', 'AF005'): 0.75570008659397792, ('AF001', 'AF002'): 0.374449352805868, ('AF001', 'AF003'): 0.00034743953114899502, ('AF002', 'AF005'): 1.2166642639889999e-06, ('AF002', 'AF004'): 1.49428208843456e-07, ('AF003', 'AF004'): 0.17452569907144502, ('AF001', 'AF004'): 0.00103026903356954, ('AF003', 'AF005'): 0.238497202355299, ('AF001', 'AF005'): 0.0011830950375467401}
我有一个非冗余相关矩阵 DF_sCorr
,它是使用 np.tril
从冗余矩阵处理而来的(@jezrael 礼貌地提供了代码)。
我想把它折叠成一个字典,其中的键是样本的排序元组{即key=tuple(sorted([row_sample,col_sample])
} 和价值是他们的价值。
我在 sif_format
下面写了一个示例函数,它生成一个类似于 sif 格式的字典(3 列 table,格式 sample_x interaction_value sample_y
),但它需要很长时间。
我认为组织此类 table 的最佳方式是字典。我觉得有一种更有效的方法可以做到这一点。可能只处理布尔值?我使用的真实数据集是 ~7000x7000
我不确定是否有函数 w/in numpy
、pandas
、scipy
或 networkx
可以执行此类操作高效处理。
import pandas as pd
import numpy as np
A_sCorr = np.array([[0.999999999999999, 0.0, 0.0, 0.0, 0.0], [0.374449352805868, 1.0, 0.0, 0.0, 0.0], [0.00034743953114899502, 1.17392596672424e-05, 1.0, 0.0, 0.0], [0.00103026903356954, 1.49428208843456e-07, 0.17452569907144502, 1.0, 0.0], [0.0011830950375467401, 1.2166642639889999e-06, 0.238497202355299, 0.75570008659397792, 1.0]])
sampleLabels = ['AF001', 'AF002', 'AF003', 'AF004', 'AF005']
DF_sCorr = pd.DataFrame(A_sCorr,columns=sampleLabels, index=sampleLabels)
#AF001 AF002 AF003 AF004 AF005
#AF001 1.000000 0.000000e+00 0.000000 0.0000 0
#AF002 0.374449 1.000000e+00 0.000000 0.0000 0
#AF003 0.000347 1.173926e-05 1.000000 0.0000 0
#AF004 0.001030 1.494282e-07 0.174526 1.0000 0
#AF005 0.001183 1.216664e-06 0.238497 0.7557 1
def sif_format(DF_var):
D_interaction_corr = {}
n,m = DF_var.shape
for i in range(n):
row_sample = DF_var.index[i]
for j in range(m):
col_sample = DF_var.columns[j]
if row_sample != col_sample:
D_interaction_corr[tuple(sorted([row_sample,col_sample]))] = DF_var.iloc[i,j]
if j==i:
break
return(D_interaction_corr)
D_interaction_corr = sif_format(DF_sCorr)
{('AF002', 'AF003'): 1.17392596672424e-05, ('AF004', 'AF005'): 0.75570008659397792, ('AF001', 'AF002'): 0.374449352805868, ('AF001', 'AF003'): 0.00034743953114899502, ('AF002', 'AF005'): 1.2166642639889999e-06, ('AF002', 'AF004'): 1.49428208843456e-07, ('AF003', 'AF004'): 0.17452569907144502, ('AF001', 'AF004'): 0.00103026903356954, ('AF003', 'AF005'): 0.238497202355299, ('AF001', 'AF005'): 0.0011830950375467401}
DataFrame.to_dict() 不适用于此
DF_sCorr.to_dict()
{'AF002': {'AF002': 1.0, 'AF003': 1.17392596672424e-05, 'AF001': 0.0, 'AF004': 1.49428208843456e-07, 'AF005': 1.2166642639889999e-06}, 'AF003': {'AF002': 0.0, 'AF003': 1.0, 'AF001': 0.0, 'AF004': 0.17452569907144502, 'AF005': 0.238497202355299}, 'AF001': {'AF002': 0.374449352805868, 'AF003': 0.00034743953114899502, 'AF001': 0.999999999999999, 'AF004': 0.00103026903356954, 'AF005': 0.0011830950375467401}, 'AF004': {'AF002': 0.0, 'AF003': 0.0, 'AF001': 0.0, 'AF004': 1.0, 'AF005': 0.75570008659397792}, 'AF005': {'AF002': 0.0, 'AF003': 0.0, 'AF001': 0.0, 'AF004': 0.0, 'AF005': 1.0}}
在pandas中这是unstack
;如果你在结果上调用 to_dict
你会得到你想要的:
In [43]: df.unstack()
Out[43]:
AF001 AF001 1.000000e+00
AF002 3.744490e-01
AF003 3.470000e-04
AF004 1.030000e-03
AF005 1.183000e-03
AF002 AF001 0.000000e+00
AF002 1.000000e+00
AF003 1.173926e-05
AF004 1.494282e-07
AF005 1.216664e-06
AF003 AF001 0.000000e+00
AF002 0.000000e+00
AF003 1.000000e+00
AF004 1.745260e-01
AF005 2.384970e-01
AF004 AF001 0.000000e+00
AF002 0.000000e+00
AF003 0.000000e+00
AF004 1.000000e+00
AF005 7.557000e-01
AF005 AF001 0.000000e+00
AF002 0.000000e+00
AF003 0.000000e+00
AF004 0.000000e+00
AF005 1.000000e+00
dtype: float64
In [45]: df.unstack().to_dict()
Out[45]:
{('AF001', 'AF001'): 1.0,
('AF001', 'AF002'): 0.37444899999999998,
('AF001', 'AF003'): 0.00034700000000000003,
('AF001', 'AF004'): 0.0010300000000000001,
('AF001', 'AF005'): 0.001183,
('AF002', 'AF001'): 0.0,
('AF002', 'AF002'): 1.0,
('AF002', 'AF003'): 1.1739260000000002e-05,
('AF002', 'AF004'): 1.4942820000000002e-07,
('AF002', 'AF005'): 1.216664e-06,
('AF003', 'AF001'): 0.0,
('AF003', 'AF002'): 0.0,
('AF003', 'AF003'): 1.0,
('AF003', 'AF004'): 0.17452599999999999,
('AF003', 'AF005'): 0.23849699999999999,
('AF004', 'AF001'): 0.0,
('AF004', 'AF002'): 0.0,
('AF004', 'AF003'): 0.0,
('AF004', 'AF004'): 1.0,
('AF004', 'AF005'): 0.75570000000000004,
('AF005', 'AF001'): 0.0,
('AF005', 'AF002'): 0.0,
('AF005', 'AF003'): 0.0,
('AF005', 'AF004'): 0.0,
('AF005', 'AF005'): 1.0}
如果您想对数据进行额外的操作,例如删除诊断或空值,您需要在未堆叠的数据帧上调用 reset_index
,然后使用您需要的任何转换,
u = df.unstack().reset_index()
u[(u['level_0']!=u['level_1']) & (u[0] != 0)]
level_0 level_1 0
1 AF001 AF002 3.744490e-01
2 AF001 AF003 3.470000e-04
3 AF001 AF004 1.030000e-03
4 AF001 AF005 1.183000e-03
7 AF002 AF003 1.173926e-05
8 AF002 AF004 1.494282e-07
9 AF002 AF005 1.216664e-06
13 AF003 AF004 1.745260e-01
14 AF003 AF005 2.384970e-01
19 AF004 AF005 7.557000e-01
所以我正在尝试做一些与这两篇文章非常相似的事情,但有一些不同之处。第一,我不想要 csv 文件,所以没有 csv 模块,我想在 Python 而不是 R 中完成它。
输入:
AF001 AF002 AF003 AF004 AF005
AF001 1.000000 0.000000e+00 0.000000 0.0000 0
AF002 0.374449 1.000000e+00 0.000000 0.0000 0
AF003 0.000347 1.173926e-05 1.000000 0.0000 0
AF004 0.001030 1.494282e-07 0.174526 1.0000 0
AF005 0.001183 1.216664e-06 0.238497 0.7557 1
输出:
{('AF002', 'AF003'): 1.17392596672424e-05, ('AF004', 'AF005'): 0.75570008659397792, ('AF001', 'AF002'): 0.374449352805868, ('AF001', 'AF003'): 0.00034743953114899502, ('AF002', 'AF005'): 1.2166642639889999e-06, ('AF002', 'AF004'): 1.49428208843456e-07, ('AF003', 'AF004'): 0.17452569907144502, ('AF001', 'AF004'): 0.00103026903356954, ('AF003', 'AF005'): 0.238497202355299, ('AF001', 'AF005'): 0.0011830950375467401}
我有一个非冗余相关矩阵 DF_sCorr
,它是使用 np.tril
从冗余矩阵处理而来的(@jezrael 礼貌地提供了代码)。
我想把它折叠成一个字典,其中的键是样本的排序元组{即key=tuple(sorted([row_sample,col_sample])
} 和价值是他们的价值。
我在 sif_format
下面写了一个示例函数,它生成一个类似于 sif 格式的字典(3 列 table,格式 sample_x interaction_value sample_y
),但它需要很长时间。
我认为组织此类 table 的最佳方式是字典。我觉得有一种更有效的方法可以做到这一点。可能只处理布尔值?我使用的真实数据集是 ~7000x7000
我不确定是否有函数 w/in numpy
、pandas
、scipy
或 networkx
可以执行此类操作高效处理。
import pandas as pd
import numpy as np
A_sCorr = np.array([[0.999999999999999, 0.0, 0.0, 0.0, 0.0], [0.374449352805868, 1.0, 0.0, 0.0, 0.0], [0.00034743953114899502, 1.17392596672424e-05, 1.0, 0.0, 0.0], [0.00103026903356954, 1.49428208843456e-07, 0.17452569907144502, 1.0, 0.0], [0.0011830950375467401, 1.2166642639889999e-06, 0.238497202355299, 0.75570008659397792, 1.0]])
sampleLabels = ['AF001', 'AF002', 'AF003', 'AF004', 'AF005']
DF_sCorr = pd.DataFrame(A_sCorr,columns=sampleLabels, index=sampleLabels)
#AF001 AF002 AF003 AF004 AF005
#AF001 1.000000 0.000000e+00 0.000000 0.0000 0
#AF002 0.374449 1.000000e+00 0.000000 0.0000 0
#AF003 0.000347 1.173926e-05 1.000000 0.0000 0
#AF004 0.001030 1.494282e-07 0.174526 1.0000 0
#AF005 0.001183 1.216664e-06 0.238497 0.7557 1
def sif_format(DF_var):
D_interaction_corr = {}
n,m = DF_var.shape
for i in range(n):
row_sample = DF_var.index[i]
for j in range(m):
col_sample = DF_var.columns[j]
if row_sample != col_sample:
D_interaction_corr[tuple(sorted([row_sample,col_sample]))] = DF_var.iloc[i,j]
if j==i:
break
return(D_interaction_corr)
D_interaction_corr = sif_format(DF_sCorr)
{('AF002', 'AF003'): 1.17392596672424e-05, ('AF004', 'AF005'): 0.75570008659397792, ('AF001', 'AF002'): 0.374449352805868, ('AF001', 'AF003'): 0.00034743953114899502, ('AF002', 'AF005'): 1.2166642639889999e-06, ('AF002', 'AF004'): 1.49428208843456e-07, ('AF003', 'AF004'): 0.17452569907144502, ('AF001', 'AF004'): 0.00103026903356954, ('AF003', 'AF005'): 0.238497202355299, ('AF001', 'AF005'): 0.0011830950375467401}
DataFrame.to_dict() 不适用于此
DF_sCorr.to_dict()
{'AF002': {'AF002': 1.0, 'AF003': 1.17392596672424e-05, 'AF001': 0.0, 'AF004': 1.49428208843456e-07, 'AF005': 1.2166642639889999e-06}, 'AF003': {'AF002': 0.0, 'AF003': 1.0, 'AF001': 0.0, 'AF004': 0.17452569907144502, 'AF005': 0.238497202355299}, 'AF001': {'AF002': 0.374449352805868, 'AF003': 0.00034743953114899502, 'AF001': 0.999999999999999, 'AF004': 0.00103026903356954, 'AF005': 0.0011830950375467401}, 'AF004': {'AF002': 0.0, 'AF003': 0.0, 'AF001': 0.0, 'AF004': 1.0, 'AF005': 0.75570008659397792}, 'AF005': {'AF002': 0.0, 'AF003': 0.0, 'AF001': 0.0, 'AF004': 0.0, 'AF005': 1.0}}
在pandas中这是unstack
;如果你在结果上调用 to_dict
你会得到你想要的:
In [43]: df.unstack()
Out[43]:
AF001 AF001 1.000000e+00
AF002 3.744490e-01
AF003 3.470000e-04
AF004 1.030000e-03
AF005 1.183000e-03
AF002 AF001 0.000000e+00
AF002 1.000000e+00
AF003 1.173926e-05
AF004 1.494282e-07
AF005 1.216664e-06
AF003 AF001 0.000000e+00
AF002 0.000000e+00
AF003 1.000000e+00
AF004 1.745260e-01
AF005 2.384970e-01
AF004 AF001 0.000000e+00
AF002 0.000000e+00
AF003 0.000000e+00
AF004 1.000000e+00
AF005 7.557000e-01
AF005 AF001 0.000000e+00
AF002 0.000000e+00
AF003 0.000000e+00
AF004 0.000000e+00
AF005 1.000000e+00
dtype: float64
In [45]: df.unstack().to_dict()
Out[45]:
{('AF001', 'AF001'): 1.0,
('AF001', 'AF002'): 0.37444899999999998,
('AF001', 'AF003'): 0.00034700000000000003,
('AF001', 'AF004'): 0.0010300000000000001,
('AF001', 'AF005'): 0.001183,
('AF002', 'AF001'): 0.0,
('AF002', 'AF002'): 1.0,
('AF002', 'AF003'): 1.1739260000000002e-05,
('AF002', 'AF004'): 1.4942820000000002e-07,
('AF002', 'AF005'): 1.216664e-06,
('AF003', 'AF001'): 0.0,
('AF003', 'AF002'): 0.0,
('AF003', 'AF003'): 1.0,
('AF003', 'AF004'): 0.17452599999999999,
('AF003', 'AF005'): 0.23849699999999999,
('AF004', 'AF001'): 0.0,
('AF004', 'AF002'): 0.0,
('AF004', 'AF003'): 0.0,
('AF004', 'AF004'): 1.0,
('AF004', 'AF005'): 0.75570000000000004,
('AF005', 'AF001'): 0.0,
('AF005', 'AF002'): 0.0,
('AF005', 'AF003'): 0.0,
('AF005', 'AF004'): 0.0,
('AF005', 'AF005'): 1.0}
如果您想对数据进行额外的操作,例如删除诊断或空值,您需要在未堆叠的数据帧上调用 reset_index
,然后使用您需要的任何转换,
u = df.unstack().reset_index()
u[(u['level_0']!=u['level_1']) & (u[0] != 0)]
level_0 level_1 0
1 AF001 AF002 3.744490e-01
2 AF001 AF003 3.470000e-04
3 AF001 AF004 1.030000e-03
4 AF001 AF005 1.183000e-03
7 AF002 AF003 1.173926e-05
8 AF002 AF004 1.494282e-07
9 AF002 AF005 1.216664e-06
13 AF003 AF004 1.745260e-01
14 AF003 AF005 2.384970e-01
19 AF004 AF005 7.557000e-01