将成对距离的长格式数据帧转换为 python 中的距离矩阵
Convert long-form dataframe of pairwise distances to distance matrix in python
我有一个 pandas 成对距离的数据框,形式为:
SampleA SampleB Num_Differences
0 sample_1 sample_2 1
1 sample_1 sample_3 4
2 sample_2 sample_3 8
请注意,没有自我比较(例如,sample_1 与 sample_1 不会被表示)。我想将此 table 转换为方形距离矩阵,如下所示:
sample_1 sample_2 sample_3
sample_1 1 4
sample_2 1 8
sample_3 4 8
任何人都可以给我一些关于如何在 python 中进行这种转换的指示吗?该问题类似于 R 中的先前问题 (Converting pairwise distances into a distance matrix in R), but I don't know the corresponding python functions to use. The problem also appears to be the opposite of this question ().
以我正在使用的形式复制数据帧的一些代码:
df = pd.DataFrame([['sample_1', 'sample_2', 1],
['sample_1', 'sample_3', 4],
['sample_2', 'sample_3', 8]],
columns=['SampleA', 'SampleB', 'Num_Differences'])
pd.pivot_table(df, values='Num_Differences', index='Sample_A',
columns='SampleB', aggfunc=max, fill_value=0)
请注意,如果同一对 Sample_A、Sample_B 的实例不超过一个,那么使用什么 aggfunc 并不重要;您可以使用 sum、max、min、mode、mean 等。如果可以使用多个,您可能需要考虑 Pandas 如何处理。
- 预先计算原始成对距离中的唯一标签数组:
idx = pd.concat([df['SampleA'], df['SampleB']]).unique()
idx.sort()
idx
array(['sample_1', 'sample_2', 'sample_3'], dtype=object)
- 旋转,然后重新索引索引和列以在生成的中间 DataFrame 中引入零值:
res = (df.pivot('SampleA', 'SampleB', 'Num_Differences')
.reindex(index=idx, columns=idx)
.fillna(0)
.astype(int))
res
SampleB sample_1 sample_2 sample_3
SampleA
sample_1 0 1 4
sample_2 0 0 8
sample_3 0 0 0
- 将中间 DataFrame 添加到它自己的转置中以生成对称的成对距离矩阵:
res += res.T
res
SampleB sample_1 sample_2 sample_3
SampleA
sample_1 0 1 4
sample_2 1 0 8
sample_3 4 8 0
我们似乎正在将加权边列表转换为邻接矩阵。我们可以使用 networkx
functions to make this conversion from_pandas_edgelist
to adjacency_matrix
:
import networkx as nx
import pandas as pd
# Create Graph
G = nx.from_pandas_edgelist(
df,
source='SampleA',
target='SampleB',
edge_attr='Num_Differences'
)
# Build adjacency matrix
adjacency_df = pd.DataFrame(
nx.adjacency_matrix(G, weight='Num_Differences').todense(),
index=G.nodes,
columns=G.nodes
)
adjacency_df
:
sample_1 sample_2 sample_3
sample_1 0 1 4
sample_2 1 0 8
sample_3 4 8 0
如果想要 NaN 而不是 0,我们也可以用 numpy.fill_diagonal
填充对角线:
import networkx as nx
import numpy as np
import pandas as pd
G = nx.from_pandas_edgelist(
df,
source='SampleA',
target='SampleB',
edge_attr='Num_Differences'
)
adjacency_df = pd.DataFrame(
nx.adjacency_matrix(G, weight='Num_Differences').todense(),
index=G.nodes,
columns=G.nodes,
dtype=float # Compatible dtype with NaN is needed
)
# Overwrite the values on the diagonal
np.fill_diagonal(adjacency_df.values, np.NaN)
adjacency_df
:
sample_1 sample_2 sample_3
sample_1 NaN 1.0 4.0
sample_2 1.0 NaN 8.0
sample_3 4.0 8.0 NaN
您可以重新整形为正方形,然后通过添加转置值使其对称:
# make unique, sorted, common index
idx = sorted(set(df['SampleA']).union(df['SampleB']))
# reshape
(df.pivot(index='SampleA', columns='SampleB', values='Num_Differences')
.reindex(index=idx, columns=idx)
.fillna(0, downcast='infer')
.pipe(lambda x: x+x.values.T)
)
或者,您可以使用有序分类索引并在使用 pivot_table
重塑期间保留 NA。然后添加转置值以使其对称:
cat = sorted(set(df['SampleA']).union(df['SampleB']))
(df.assign(SampleA=pd.Categorical(df['SampleA'],
categories=cat,
ordered=True),
SampleB=pd.Categorical(df['SampleB'],
categories=cat,
ordered=True),
)
.pivot_table(index='SampleA',
columns='SampleB',
values='Num_Differences',
dropna=False, fill_value=0)
.pipe(lambda x: x+x.values.T)
)
输出:
SampleB sample_1 sample_2 sample_3
SampleA
sample_1 0 1 4
sample_2 1 0 8
sample_3 4 8 0
我有一个 pandas 成对距离的数据框,形式为:
SampleA SampleB Num_Differences
0 sample_1 sample_2 1
1 sample_1 sample_3 4
2 sample_2 sample_3 8
请注意,没有自我比较(例如,sample_1 与 sample_1 不会被表示)。我想将此 table 转换为方形距离矩阵,如下所示:
sample_1 sample_2 sample_3
sample_1 1 4
sample_2 1 8
sample_3 4 8
任何人都可以给我一些关于如何在 python 中进行这种转换的指示吗?该问题类似于 R 中的先前问题 (Converting pairwise distances into a distance matrix in R), but I don't know the corresponding python functions to use. The problem also appears to be the opposite of this question (
以我正在使用的形式复制数据帧的一些代码:
df = pd.DataFrame([['sample_1', 'sample_2', 1],
['sample_1', 'sample_3', 4],
['sample_2', 'sample_3', 8]],
columns=['SampleA', 'SampleB', 'Num_Differences'])
pd.pivot_table(df, values='Num_Differences', index='Sample_A',
columns='SampleB', aggfunc=max, fill_value=0)
请注意,如果同一对 Sample_A、Sample_B 的实例不超过一个,那么使用什么 aggfunc 并不重要;您可以使用 sum、max、min、mode、mean 等。如果可以使用多个,您可能需要考虑 Pandas 如何处理。
- 预先计算原始成对距离中的唯一标签数组:
idx = pd.concat([df['SampleA'], df['SampleB']]).unique()
idx.sort()
idx
array(['sample_1', 'sample_2', 'sample_3'], dtype=object)
- 旋转,然后重新索引索引和列以在生成的中间 DataFrame 中引入零值:
res = (df.pivot('SampleA', 'SampleB', 'Num_Differences')
.reindex(index=idx, columns=idx)
.fillna(0)
.astype(int))
res
SampleB sample_1 sample_2 sample_3
SampleA
sample_1 0 1 4
sample_2 0 0 8
sample_3 0 0 0
- 将中间 DataFrame 添加到它自己的转置中以生成对称的成对距离矩阵:
res += res.T
res
SampleB sample_1 sample_2 sample_3
SampleA
sample_1 0 1 4
sample_2 1 0 8
sample_3 4 8 0
我们似乎正在将加权边列表转换为邻接矩阵。我们可以使用 networkx
functions to make this conversion from_pandas_edgelist
to adjacency_matrix
:
import networkx as nx
import pandas as pd
# Create Graph
G = nx.from_pandas_edgelist(
df,
source='SampleA',
target='SampleB',
edge_attr='Num_Differences'
)
# Build adjacency matrix
adjacency_df = pd.DataFrame(
nx.adjacency_matrix(G, weight='Num_Differences').todense(),
index=G.nodes,
columns=G.nodes
)
adjacency_df
:
sample_1 sample_2 sample_3
sample_1 0 1 4
sample_2 1 0 8
sample_3 4 8 0
如果想要 NaN 而不是 0,我们也可以用 numpy.fill_diagonal
填充对角线:
import networkx as nx
import numpy as np
import pandas as pd
G = nx.from_pandas_edgelist(
df,
source='SampleA',
target='SampleB',
edge_attr='Num_Differences'
)
adjacency_df = pd.DataFrame(
nx.adjacency_matrix(G, weight='Num_Differences').todense(),
index=G.nodes,
columns=G.nodes,
dtype=float # Compatible dtype with NaN is needed
)
# Overwrite the values on the diagonal
np.fill_diagonal(adjacency_df.values, np.NaN)
adjacency_df
:
sample_1 sample_2 sample_3
sample_1 NaN 1.0 4.0
sample_2 1.0 NaN 8.0
sample_3 4.0 8.0 NaN
您可以重新整形为正方形,然后通过添加转置值使其对称:
# make unique, sorted, common index
idx = sorted(set(df['SampleA']).union(df['SampleB']))
# reshape
(df.pivot(index='SampleA', columns='SampleB', values='Num_Differences')
.reindex(index=idx, columns=idx)
.fillna(0, downcast='infer')
.pipe(lambda x: x+x.values.T)
)
或者,您可以使用有序分类索引并在使用 pivot_table
重塑期间保留 NA。然后添加转置值以使其对称:
cat = sorted(set(df['SampleA']).union(df['SampleB']))
(df.assign(SampleA=pd.Categorical(df['SampleA'],
categories=cat,
ordered=True),
SampleB=pd.Categorical(df['SampleB'],
categories=cat,
ordered=True),
)
.pivot_table(index='SampleA',
columns='SampleB',
values='Num_Differences',
dropna=False, fill_value=0)
.pipe(lambda x: x+x.values.T)
)
输出:
SampleB sample_1 sample_2 sample_3
SampleA
sample_1 0 1 4
sample_2 1 0 8
sample_3 4 8 0