根据 match/mismatch 的排序算法

Question

我正在处理的数据代表了 5 个栖息地中某些物种的存在或不存在。我想根据它们之间的共享区域获得簇，基本上我想最大化每个物种的元素之间的匹配。
这是原始数据集

这就是我正在寻找的排序类型
我通过简单地对占用区域的数量进行排序，然后手动修复明显的错误，设法获得了 3 个组。
这种非常简化的算法之所以有效，是因为占用区域总是连续的，并且存在一些常见的模式。

起初我认为这个问题有点类似于Sequence Alignment，但我实在想不出有什么方法可以将这些算法应用到我的案例中。此外，我还想找到一种自动将这些数据分为 3 组的方法，但这并不是绝对必要的。

真正的挑战是对包含间隙的 new Dataset 进行排序。
我认为最好的方法应该是：

计算每对可能的物种之间匹配和不匹配的数量
找到与第一个最匹配的物种并将它们放在一起
重复第二个物种（刚刚重新定位的那个）排除第一个物种的检查（以避免无限循环），依此类推...

## Original df
    Species Zones array
0   A   [0, 1, 1, 1, 1]
1   B   [0, 1, 1, 1, 1]
2   C   [0, 1, 1, 1, 1]
3   D   [0, 1, 1, 0, 0]
4   E   [0, 1, 1, 1, 1]
5   F   [0, 1, 1, 1, 1]
6   G   [0, 1, 1, 1, 1]
7   H   [0, 1, 1, 1, 1]
8   I   [1, 1, 1, 1, 1]
9   J   [0, 1, 1, 1, 1]
10  K   [1, 1, 1, 0, 0]
11  L   [1, 1, 1, 0, 0]
12  M   [1, 1, 1, 1, 1]
13  N   [0, 1, 1, 1, 1]
14  O   [0, 0, 1, 1, 1]
15  P   [0, 1, 1, 1, 1]
16  Q   [0, 0, 1, 1, 1]
17  R   [1, 1, 1, 0, 0]
18  S   [0, 1, 1, 1, 1]
19  T   [0, 1, 1, 1, 0]
20  U   [0, 1, 1, 1, 1]


## Sorted df
  Species   Zones array
0   A   [1, 1, 1, 1, 1]
1   B   [1, 1, 1, 1, 1]
2   C   [0, 1, 1, 1, 1]
3   D   [0, 1, 1, 1, 1]
4   E   [0, 1, 1, 1, 1]
5   F   [0, 1, 1, 1, 1]
6   G   [0, 1, 1, 1, 1]
7   H   [0, 1, 1, 1, 1]
8   I   [0, 1, 1, 1, 1]
9   J   [0, 1, 1, 1, 1]
10  K   [0, 1, 1, 1, 1]
11  L   [0, 1, 1, 1, 1]
12  M   [0, 1, 1, 1, 1]
13  N   [0, 1, 1, 1, 1]
14  O   [0, 0, 1, 1, 1]
15  P   [0, 0, 1, 1, 1]
16  Q   [0, 1, 1, 1, 0]
17  R   [0, 1, 1, 0, 0]
18  S   [1, 1, 1, 0, 0]
19  T   [1, 1, 1, 0, 0]
20  U   [1, 1, 1, 0, 0]

## New gapped df
    Species Prey
0   A   [1, 1, 1, 0, 1, 0, 1, 1, 1]
1   B   [1, 0, 1, 0, 1, 1, 1, 0, 1]
2   C   [1, 1, 1, 0, 1, 0, 1, 0, 1]
3   D   [1, 1, 1, 0, 1, 0, 1, 0, 0]
4   E   [1, 1, 1, 0, 1, 0, 0, 0, 0]
5   F   [1, 0, 1, 0, 1, 1, 0, 0, 0]
6   G   [1, 1, 1, 1, 0, 0, 0, 0, 0]
7   H   [1, 1, 1, 0, 0, 0, 0, 0, 0]
8   I   [1, 0, 1, 0, 0, 0, 0, 0, 0]
9   J   [0, 0, 1, 0, 0, 1, 0, 0, 0]
10  K   [1, 0, 1, 0, 0, 0, 0, 0, 0]
11  L   [1, 0, 1, 0, 0, 0, 0, 0, 0]
12  M   [1, 0, 0, 0, 0, 0, 0, 0, 0]
13  N   [0, 0, 0, 0, 0, 1, 0, 0, 0]
14  O   [1, 0, 0, 0, 0, 0, 0, 0, 0]
15  P   [1, 0, 0, 0, 0, 0, 0, 0, 0]
16  Q   [0, 0, 0, 0, 0, 1, 0, 0, 0]
17  R   [1, 0, 0, 0, 0, 0, 0, 0, 0]
18  S   [0, 0, 1, 0, 0, 0, 0, 0, 0]
19  T   [0, 0, 1, 0, 0, 0, 0, 0, 0]
20  U   [0, 0, 1, 0, 0, 0, 0, 0, 0]

Answer 1

聚类的方法有很多，除非您有特殊情况，否则最好使用现成的方法。这是一个使用 kmeans 聚类的示例，为您的用例设置 k=3。该解决方案使用了很多您可能需要安装的软件包。 sklearn 用于 kmeans 聚类，然后其余导入用于绘图。

import pandas as pd
import sklearn.cluster

from matplotlib.colors import LinearSegmentedColormap
import matplotlib.pyplot as plt
import seaborn as sns

df = pd.DataFrame({ 
    'A':[0, 1, 1, 1, 1], 
    'B':[0, 1, 1, 1, 1], 
    'C':[0, 1, 1, 1, 1], 
    'D':[0, 1, 1, 0, 0], 
    'E':[0, 1, 1, 1, 1], 
    'F':[0, 1, 1, 1, 1], 
    'G':[0, 1, 1, 1, 1], 
    'H':[0, 1, 1, 1, 1], 
    'I':[1, 1, 1, 1, 1], 
    'J':[0, 1, 1, 1, 1], 
    'K':[1, 1, 1, 0, 0], 
    'L':[1, 1, 1, 0, 0], 
    'M':[1, 1, 1, 1, 1], 
    'N':[0, 1, 1, 1, 1], 
    'O':[0, 0, 1, 1, 1], 
    'P':[0, 1, 1, 1, 1], 
    'Q':[0, 0, 1, 1, 1], 
    'R':[1, 1, 1, 0, 0], 
    'S':[0, 1, 1, 1, 1], 
    'T':[0, 1, 1, 1, 0], 
    'U':[0, 1, 1, 1, 1], 
}).T #transposing to be species as rows and zones as columns

#gapped species
df = pd.DataFrame({
    'A':[1, 1, 1, 0, 1, 0, 1, 1, 1], 
    'B':[1, 0, 1, 0, 1, 1, 1, 0, 1], 
    'C':[1, 1, 1, 0, 1, 0, 1, 0, 1], 
    'D':[1, 1, 1, 0, 1, 0, 1, 0, 0], 
    'E':[1, 1, 1, 0, 1, 0, 0, 0, 0], 
    'F':[1, 0, 1, 0, 1, 1, 0, 0, 0], 
    'G':[1, 1, 1, 1, 0, 0, 0, 0, 0], 
    'H':[1, 1, 1, 0, 0, 0, 0, 0, 0], 
    'I':[1, 0, 1, 0, 0, 0, 0, 0, 0], 
    'J':[0, 0, 1, 0, 0, 1, 0, 0, 0], 
    'K':[1, 0, 1, 0, 0, 0, 0, 0, 0], 
    'L':[1, 0, 1, 0, 0, 0, 0, 0, 0], 
    'M':[1, 0, 0, 0, 0, 0, 0, 0, 0], 
    'N':[0, 0, 0, 0, 0, 1, 0, 0, 0], 
    'O':[1, 0, 0, 0, 0, 0, 0, 0, 0], 
    'P':[1, 0, 0, 0, 0, 0, 0, 0, 0], 
    'Q':[0, 0, 0, 0, 0, 1, 0, 0, 0], 
    'R':[1, 0, 0, 0, 0, 0, 0, 0, 0], 
    'S':[0, 0, 1, 0, 0, 0, 0, 0, 0], 
    'T':[0, 0, 1, 0, 0, 0, 0, 0, 0], 
    'U':[0, 0, 1, 0, 0, 0, 0, 0, 0], 
}).T #transposing to be species as rows and zones as columns 

#Cluster by kmeans with k=3
kmeans = sklearn.cluster.KMeans(n_clusters=3).fit(df)
cluster_labels = pd.Series(kmeans.labels_+1, index=df.index)

#Prepare for plotting
plot_data = df.multiply(cluster_labels,axis=0)
sorted_plot_data = plot_data.loc[cluster_labels.sort_values().index]

myColors = (
    (1.0, 1.0, 1.0, 1.0), #0-value color
    (0.4, 0.0, 0.4, 1.0), #cluster 1 color
    (0.0, 0.4, 0.4, 1.0), #cluster 2 color
    (0.4, 0.4, 0.0, 1.0), #cluster 3 color
)
my_cmap = LinearSegmentedColormap.from_list('Custom', myColors, len(myColors))

#Unsorted Heatmap
fig = plt.figure(figsize=(8,6))
ax = sns.heatmap(
    
    data = plot_data.T,
    cmap=my_cmap,
    cbar=False,
)
ax.invert_yaxis()
plt.title('Species in order colored by k-means cluster')
plt.show()
plt.close()

#Sorted Heatmap
fig = plt.figure(figsize=(8,6))
ax = sns.heatmap(
    data = sorted_plot_data.T,
    cmap=my_cmap,
    cbar=False,
)
ax.invert_yaxis()
plt.title('Species sorted and colored by k-means cluster, sorted alphabetically within cluster')
plt.show()
plt.close()

输出图

Answer 2

我想我以一种简单而适当的方式解决了这个问题（不过我相信可能会有更好的算法）。我检查每对布尔向量之间的匹配数：

n=len(diet_bool_df['Prey'])
match_mtx = [[0 for i in range(n)] for j in range(n)] #empty matrix nxn

for i in range(n):
    for j in range(n):
#this gives a vector containing 1 for a match and 0 for a mismatch position-wise
        match_vec = [int(diet_bool_df['Boolean Diet'][i][k]==diet_bool_df['Prey'][j][k]) for k in range(len(diet_bool_df['Prey'][i]))]
#the sum of all 1 gives the number of matches between each couple (i,j)
        match_mtx[i][j] = sum(match_vec)

由于有 9 个猎物，因此最多有 9 个匹配项，这表明它们的饮食相同。这个新矩阵现在实际上可以用 kmeans 聚类，因为数据是各向同性的并且应该满足所有假设。

def plot_match_cluster():
    fig= go.Figure()
    fig.add_trace(go.Heatmap(
        z=match_mtx,
    ))
    fig.show()

kmeans = sklearn.cluster.KMeans(n_clusters=3).fit_predict(match_mtx)

新的聚类数据如下所示
Clusterized Data

注意。有趣的是，我根据猎物的数量决定的初始排序是在这种情况下创建集群的一个很好的指标。不过，这仅适用于 3 个集群。

根据 match/mismatch 的排序算法

Sorting Algorithm according to match/mismatch

python

sorting

cluster-analysis

dataframe

pandas