如何打乱二维二进制矩阵,保留边际分布

How to shuffle a 2d binary matrix, preserving marginal distributions

假设我有一个 (n*m) 二进制矩阵 df 类似于以下内容:

import pandas as pd
import numpy as np

df = pd.DataFrame(np.random.binomial(1, .3, size=(6,8)))

    0   1   2   3   4   5   6   7
   ------------------------------
0 | 0   0   0   0   0   1   1   0
1 | 0   1   0   0   0   0   0   0
2 | 0   0   0   0   1   0   0   0
3 | 0   0   0   0   0   1   0   1
4 | 0   1   1   0   1   0   0   0
5 | 1   0   1   1   1   0   0   1

我想打乱矩阵中的值以创建相同形状的 new_df,使得两个边际分布相同,例如:

    0   1   2   3   4   5   6   7
   ------------------------------
0 | 0   0   0   0   1   0   0   1
1 | 0   0   0   0   1   0   0   0
2 | 0   0   0   0   0   0   0   1
3 | 0   1   1   0   0   0   0   0
4 | 1   0   0   0   1   1   0   0
5 | 0   1   1   1   0   1   1   0

在新矩阵中,每一行的和等于原矩阵中对应行的和,同样,新矩阵中的列与原矩阵中对应列的和相同。

解决方案很容易检查:

# rows have the same marginal distribution
assert(all(df.sum(axis=1) == new_df.sum(axis=1)))  

# columns have the same marginal distribution
assert(all(df.sum(axis=0) == new_df.sum(axis=0)))

如果 n*m 很小,我可以使用强力方法来洗牌:

def shuffle_2d(df):
    """Shuffles a multidimensional binary array, preserving marginal distributions"""
    # get a list of indices where the df is 1
    rowlist = []
    collist = []
    for i_row, row in df.iterrows():
        for i_col, val in row.iteritems():
            if df.loc[i_row, i_col] == 1:
                rowlist.append(i_row)
                collist.append(i_col)

    # create an empty df of the same shape
    new_df = pd.DataFrame(index=df.index, columns=df.columns, data=0)

    # shuffle until you get no repeat coordinates 
    # this is so you don't increment the same cell in the matrix twice
    repeats = 999
    while repeats > 1:
        pairs = list(zip(np.random.permutation(rowlist), np.random.permutation(collist)))
        repeats = pd.value_counts(pairs).max()

    # populate new data frame at indicated points
    for i_row, i_col in pairs:
        new_df.at[i_row, i_col] += 1

    return new_df

问题在于蛮力方法的扩展性很差。 (正如印第安纳琼斯和最后的圣战中的那句台词:https://youtu.be/Ubw5N8iVDHI?t=3

作为一个快速演示,对于 n*n 矩阵,获得可接受的随机播放所需的尝试次数如下所示:(一次 运行)

n   attempts
2   1
3   2
4   4
5   1
6   1
7   11
8   9
9   22
10  4416
11  800
12  66
13  234
14  5329
15  26501
16  27555
17  5932
18  668902
...

是否有一个简单的解决方案可以保留精确的边际分布(或告诉您哪里没有其他模式可以保留该分布)?

作为后备方案,我还可以使用可以最小化每行误差平方和的近似算法。

谢谢! =)


编辑: 出于某种原因,在我写这个问题之前我没有找到现有的答案,但是在发布之后它们都出现在侧边栏中:

Is it possible to shuffle a 2D matrix while preserving row AND column frequencies?

Randomize matrix in perl, keeping row and column totals the same

有时候你需要做的就是问...

主要感谢 的启发,这是一个似乎有效的解决方案:


def flip1(m):
    """
    Chooses a single (i0, j0) location in the matrix to 'flip'
    Then randomly selects a different (i, j) location that creates
    a quad [(i0, j0), (i0, j), (i, j0), (i, j) in which flipping every
    element leaves the marginal distributions unaltered.  
    Changes those elements, and returns 1.

    If such a quad cannot be completed from the original position, 
    does nothing and returns 0.
    """
    i0 = np.random.randint(m.shape[0])
    j0 = np.random.randint(m.shape[1])

    level = m[i0, j0]
    flip = 0 if level == 1 else 1  # the opposite value

    for i in np.random.permutation(range(m.shape[0])):  # try in random order
        if (i != i0 and  # don't swap with self
            m[i, j0] != level):  # maybe swap with a cell that holds opposite value
            for j in np.random.permutation(range(m.shape[1])):
                if (j != j0 and  # don't swap with self
                    m[i, j] == level and  # check that other swaps work
                    m[i0, j] != level):
                    # make the swaps
                    m[i0, j0] = flip
                    m[i0, j] = level
                    m[i, j0] = level
                    m[i, j] = flip
                    return 1

    return 0

def shuffle(m1, n=100):
    m2 = m1.copy()
    f_success = np.mean([flip1(m2) for _ in range(n)])

    # f_success is the fraction of flip attempts that succeed, for diagnostics
    #print(f_success)

    # check the answer
    assert(all(m1.sum(axis=1) == m2.sum(axis=1)))
    assert(all(m1.sum(axis=0) == m2.sum(axis=0)))

    return m2

我们可以称其为:

m1 = np.random.binomial(1, .3, size=(6,8))
array([[0, 0, 0, 1, 1, 0, 0, 1],
       [1, 0, 0, 0, 0, 0, 1, 0],
       [0, 0, 0, 1, 0, 1, 0, 1],
       [1, 1, 0, 0, 0, 1, 0, 1],
       [0, 0, 0, 0, 0, 1, 0, 0],
       [1, 0, 1, 0, 1, 0, 0, 0]])
m2 = shuffle(m1)
array([[0, 0, 0, 0, 1, 1, 0, 1],
       [1, 0, 0, 0, 0, 1, 0, 0],
       [0, 0, 0, 1, 0, 0, 1, 1],
       [1, 1, 1, 0, 1, 0, 0, 0],
       [0, 0, 0, 0, 0, 1, 0, 0],
       [1, 0, 0, 1, 0, 0, 0, 1]])

我们需要多少次迭代才能达到稳态分布?我在这里设置了默认值 100,这对于这些小矩阵来说已经足够了。

下面我绘制了不同迭代次数的原始矩阵和打乱矩阵(500 次)之间的相关性。

for _ in range(500):
    m1 = np.random.binomial(1, .3, size=(9,9)) # create starting df
    m2 = shuffle(m1, n_iters)
    corrs.append(np.corrcoef(m1.flatten(), m2.flatten())[1,0])

plt.hist(corrs, bins=40, alpha=.4, label=n_iters)

对于 9x9 矩阵,我们看到改进直到大约 25 次迭代,超过它我们处于稳定状态。

对于 18x18 矩阵,我们看到从 100 次迭代到 250 次迭代有小幅提升,但不会超出太多。

请注意,对于较大的矩阵,开始和结束分布之间的相关性较低,但我们需要更长的时间才能到达那里。

您必须寻找两行两列,它们的切点给出一个矩阵,顶部为 1 0,底部为 0 1(或相反)。您可以切换这些值(到 01 和 10)。

Verhelst (2008, link to article page) 开发的算法甚至可以从具有相同边缘的所有可能矩阵中采样(在 R 包 RaschSampler 中实现)。

Wang 的更新算法(2020,link)也可用,在某些情况下效率更高。