将成对距离的长格式数据帧转换为 python 中的距离矩阵

Question

我有一个 pandas 成对距离的数据框，形式为：

    SampleA   SampleB  Num_Differences
0  sample_1  sample_2                1
1  sample_1  sample_3                4
2  sample_2  sample_3                8

请注意，没有自我比较（例如，sample_1 与 sample_1 不会被表示）。我想将此 table 转换为方形距离矩阵，如下所示：

            sample_1      sample_2  sample_3
sample_1                       1              4
sample_2         1                            8
sample_3         4             8

任何人都可以给我一些关于如何在 python 中进行这种转换的指示吗？该问题类似于 R 中的先前问题 (Converting pairwise distances into a distance matrix in R), but I don't know the corresponding python functions to use. The problem also appears to be the opposite of this question ().

以我正在使用的形式复制数据帧的一些代码：

df = pd.DataFrame([['sample_1', 'sample_2', 1],
                   ['sample_1', 'sample_3', 4],
                   ['sample_2', 'sample_3', 8]],
                  columns=['SampleA', 'SampleB', 'Num_Differences'])

Answer 1

 pd.pivot_table(df, values='Num_Differences', index='Sample_A',
                columns='SampleB', aggfunc=max, fill_value=0)

请注意，如果同一对 Sample_A、Sample_B 的实例不超过一个，那么使用什么 aggfunc 并不重要；您可以使用 sum、max、min、mode、mean 等。如果可以使用多个，您可能需要考虑 Pandas 如何处理。

Answer 2

预先计算原始成对距离中的唯一标签数组：

idx = pd.concat([df['SampleA'], df['SampleB']]).unique()
idx.sort() 
idx

array(['sample_1', 'sample_2', 'sample_3'], dtype=object)

旋转，然后重新索引索引和列以在生成的中间 DataFrame 中引入零值：

res = (df.pivot('SampleA', 'SampleB', 'Num_Differences')
         .reindex(index=idx, columns=idx)
         .fillna(0)
         .astype(int))
res

SampleB   sample_1  sample_2  sample_3
SampleA                               
sample_1         0         1         4
sample_2         0         0         8
sample_3         0         0         0

将中间 DataFrame 添加到它自己的转置中以生成对称的成对距离矩阵：

res += res.T
res

SampleB   sample_1  sample_2  sample_3
SampleA                               
sample_1         0         1         4
sample_2         1         0         8
sample_3         4         8         0

Answer 3

我们似乎正在将加权边列表转换为邻接矩阵。我们可以使用 networkx functions to make this conversion from_pandas_edgelist to adjacency_matrix:

import networkx as nx
import pandas as pd

# Create Graph
G = nx.from_pandas_edgelist(
    df,
    source='SampleA',
    target='SampleB',
    edge_attr='Num_Differences'
)

# Build adjacency matrix
adjacency_df = pd.DataFrame(
    nx.adjacency_matrix(G, weight='Num_Differences').todense(),
    index=G.nodes,
    columns=G.nodes
)

adjacency_df:

          sample_1  sample_2  sample_3
sample_1         0         1         4
sample_2         1         0         8
sample_3         4         8         0

如果想要 NaN 而不是 0，我们也可以用 numpy.fill_diagonal 填充对角线：

import networkx as nx
import numpy as np
import pandas as pd


G = nx.from_pandas_edgelist(
    df,
    source='SampleA',
    target='SampleB',
    edge_attr='Num_Differences'
)

adjacency_df = pd.DataFrame(
    nx.adjacency_matrix(G, weight='Num_Differences').todense(),
    index=G.nodes,
    columns=G.nodes,
    dtype=float  # Compatible dtype with NaN is needed
)
# Overwrite the values on the diagonal
np.fill_diagonal(adjacency_df.values, np.NaN)

adjacency_df:

          sample_1  sample_2  sample_3
sample_1       NaN       1.0       4.0
sample_2       1.0       NaN       8.0
sample_3       4.0       8.0       NaN

Answer 4

您可以重新整形为正方形，然后通过添加转置值使其对称：

# make unique, sorted, common index
idx = sorted(set(df['SampleA']).union(df['SampleB']))

# reshape
(df.pivot(index='SampleA', columns='SampleB', values='Num_Differences')
   .reindex(index=idx, columns=idx)
   .fillna(0, downcast='infer')
   .pipe(lambda x: x+x.values.T)
 )

或者，您可以使用有序分类索引并在使用 pivot_table 重塑期间保留 NA。然后添加转置值以使其对称：

cat = sorted(set(df['SampleA']).union(df['SampleB']))

(df.assign(SampleA=pd.Categorical(df['SampleA'],
                                  categories=cat,
                                  ordered=True),
           SampleB=pd.Categorical(df['SampleB'],
                                  categories=cat,
                                  ordered=True),
           )
    .pivot_table(index='SampleA',
                 columns='SampleB',
                 values='Num_Differences',
                 dropna=False, fill_value=0)
    .pipe(lambda x: x+x.values.T)
)

输出：

SampleB   sample_1  sample_2  sample_3
SampleA                               
sample_1         0         1         4
sample_2         1         0         8
sample_3         4         8         0

将成对距离的长格式数据帧转换为 python 中的距离矩阵

Convert long-form dataframe of pairwise distances to distance matrix in python

python

matrix

pandas

pairwise