如何打乱二维二进制矩阵,保留边际分布
How to shuffle a 2d binary matrix, preserving marginal distributions
假设我有一个 (n*m) 二进制矩阵 df
类似于以下内容:
import pandas as pd
import numpy as np
df = pd.DataFrame(np.random.binomial(1, .3, size=(6,8)))
0 1 2 3 4 5 6 7
------------------------------
0 | 0 0 0 0 0 1 1 0
1 | 0 1 0 0 0 0 0 0
2 | 0 0 0 0 1 0 0 0
3 | 0 0 0 0 0 1 0 1
4 | 0 1 1 0 1 0 0 0
5 | 1 0 1 1 1 0 0 1
我想打乱矩阵中的值以创建相同形状的 new_df
,使得两个边际分布相同,例如:
0 1 2 3 4 5 6 7
------------------------------
0 | 0 0 0 0 1 0 0 1
1 | 0 0 0 0 1 0 0 0
2 | 0 0 0 0 0 0 0 1
3 | 0 1 1 0 0 0 0 0
4 | 1 0 0 0 1 1 0 0
5 | 0 1 1 1 0 1 1 0
在新矩阵中,每一行的和等于原矩阵中对应行的和,同样,新矩阵中的列与原矩阵中对应列的和相同。
解决方案很容易检查:
# rows have the same marginal distribution
assert(all(df.sum(axis=1) == new_df.sum(axis=1)))
# columns have the same marginal distribution
assert(all(df.sum(axis=0) == new_df.sum(axis=0)))
如果 n*m 很小,我可以使用强力方法来洗牌:
def shuffle_2d(df):
"""Shuffles a multidimensional binary array, preserving marginal distributions"""
# get a list of indices where the df is 1
rowlist = []
collist = []
for i_row, row in df.iterrows():
for i_col, val in row.iteritems():
if df.loc[i_row, i_col] == 1:
rowlist.append(i_row)
collist.append(i_col)
# create an empty df of the same shape
new_df = pd.DataFrame(index=df.index, columns=df.columns, data=0)
# shuffle until you get no repeat coordinates
# this is so you don't increment the same cell in the matrix twice
repeats = 999
while repeats > 1:
pairs = list(zip(np.random.permutation(rowlist), np.random.permutation(collist)))
repeats = pd.value_counts(pairs).max()
# populate new data frame at indicated points
for i_row, i_col in pairs:
new_df.at[i_row, i_col] += 1
return new_df
问题在于蛮力方法的扩展性很差。 (正如印第安纳琼斯和最后的圣战中的那句台词:https://youtu.be/Ubw5N8iVDHI?t=3)
作为一个快速演示,对于 n*n 矩阵,获得可接受的随机播放所需的尝试次数如下所示:(一次 运行)
n attempts
2 1
3 2
4 4
5 1
6 1
7 11
8 9
9 22
10 4416
11 800
12 66
13 234
14 5329
15 26501
16 27555
17 5932
18 668902
...
是否有一个简单的解决方案可以保留精确的边际分布(或告诉您哪里没有其他模式可以保留该分布)?
作为后备方案,我还可以使用可以最小化每行误差平方和的近似算法。
谢谢! =)
编辑:
出于某种原因,在我写这个问题之前我没有找到现有的答案,但是在发布之后它们都出现在侧边栏中:
Is it possible to shuffle a 2D matrix while preserving row AND column frequencies?
Randomize matrix in perl, keeping row and column totals the same
有时候你需要做的就是问...
主要感谢 的启发,这是一个似乎有效的解决方案:
def flip1(m):
"""
Chooses a single (i0, j0) location in the matrix to 'flip'
Then randomly selects a different (i, j) location that creates
a quad [(i0, j0), (i0, j), (i, j0), (i, j) in which flipping every
element leaves the marginal distributions unaltered.
Changes those elements, and returns 1.
If such a quad cannot be completed from the original position,
does nothing and returns 0.
"""
i0 = np.random.randint(m.shape[0])
j0 = np.random.randint(m.shape[1])
level = m[i0, j0]
flip = 0 if level == 1 else 1 # the opposite value
for i in np.random.permutation(range(m.shape[0])): # try in random order
if (i != i0 and # don't swap with self
m[i, j0] != level): # maybe swap with a cell that holds opposite value
for j in np.random.permutation(range(m.shape[1])):
if (j != j0 and # don't swap with self
m[i, j] == level and # check that other swaps work
m[i0, j] != level):
# make the swaps
m[i0, j0] = flip
m[i0, j] = level
m[i, j0] = level
m[i, j] = flip
return 1
return 0
def shuffle(m1, n=100):
m2 = m1.copy()
f_success = np.mean([flip1(m2) for _ in range(n)])
# f_success is the fraction of flip attempts that succeed, for diagnostics
#print(f_success)
# check the answer
assert(all(m1.sum(axis=1) == m2.sum(axis=1)))
assert(all(m1.sum(axis=0) == m2.sum(axis=0)))
return m2
我们可以称其为:
m1 = np.random.binomial(1, .3, size=(6,8))
array([[0, 0, 0, 1, 1, 0, 0, 1],
[1, 0, 0, 0, 0, 0, 1, 0],
[0, 0, 0, 1, 0, 1, 0, 1],
[1, 1, 0, 0, 0, 1, 0, 1],
[0, 0, 0, 0, 0, 1, 0, 0],
[1, 0, 1, 0, 1, 0, 0, 0]])
m2 = shuffle(m1)
array([[0, 0, 0, 0, 1, 1, 0, 1],
[1, 0, 0, 0, 0, 1, 0, 0],
[0, 0, 0, 1, 0, 0, 1, 1],
[1, 1, 1, 0, 1, 0, 0, 0],
[0, 0, 0, 0, 0, 1, 0, 0],
[1, 0, 0, 1, 0, 0, 0, 1]])
我们需要多少次迭代才能达到稳态分布?我在这里设置了默认值 100,这对于这些小矩阵来说已经足够了。
下面我绘制了不同迭代次数的原始矩阵和打乱矩阵(500 次)之间的相关性。
for _ in range(500):
m1 = np.random.binomial(1, .3, size=(9,9)) # create starting df
m2 = shuffle(m1, n_iters)
corrs.append(np.corrcoef(m1.flatten(), m2.flatten())[1,0])
plt.hist(corrs, bins=40, alpha=.4, label=n_iters)
对于 9x9 矩阵,我们看到改进直到大约 25 次迭代,超过它我们处于稳定状态。
对于 18x18 矩阵,我们看到从 100 次迭代到 250 次迭代有小幅提升,但不会超出太多。
请注意,对于较大的矩阵,开始和结束分布之间的相关性较低,但我们需要更长的时间才能到达那里。
您必须寻找两行两列,它们的切点给出一个矩阵,顶部为 1 0,底部为 0 1(或相反)。您可以切换这些值(到 01 和 10)。
Verhelst (2008, link to article page) 开发的算法甚至可以从具有相同边缘的所有可能矩阵中采样(在 R 包 RaschSampler 中实现)。
Wang 的更新算法(2020,link)也可用,在某些情况下效率更高。
假设我有一个 (n*m) 二进制矩阵 df
类似于以下内容:
import pandas as pd
import numpy as np
df = pd.DataFrame(np.random.binomial(1, .3, size=(6,8)))
0 1 2 3 4 5 6 7
------------------------------
0 | 0 0 0 0 0 1 1 0
1 | 0 1 0 0 0 0 0 0
2 | 0 0 0 0 1 0 0 0
3 | 0 0 0 0 0 1 0 1
4 | 0 1 1 0 1 0 0 0
5 | 1 0 1 1 1 0 0 1
我想打乱矩阵中的值以创建相同形状的 new_df
,使得两个边际分布相同,例如:
0 1 2 3 4 5 6 7
------------------------------
0 | 0 0 0 0 1 0 0 1
1 | 0 0 0 0 1 0 0 0
2 | 0 0 0 0 0 0 0 1
3 | 0 1 1 0 0 0 0 0
4 | 1 0 0 0 1 1 0 0
5 | 0 1 1 1 0 1 1 0
在新矩阵中,每一行的和等于原矩阵中对应行的和,同样,新矩阵中的列与原矩阵中对应列的和相同。
解决方案很容易检查:
# rows have the same marginal distribution
assert(all(df.sum(axis=1) == new_df.sum(axis=1)))
# columns have the same marginal distribution
assert(all(df.sum(axis=0) == new_df.sum(axis=0)))
如果 n*m 很小,我可以使用强力方法来洗牌:
def shuffle_2d(df):
"""Shuffles a multidimensional binary array, preserving marginal distributions"""
# get a list of indices where the df is 1
rowlist = []
collist = []
for i_row, row in df.iterrows():
for i_col, val in row.iteritems():
if df.loc[i_row, i_col] == 1:
rowlist.append(i_row)
collist.append(i_col)
# create an empty df of the same shape
new_df = pd.DataFrame(index=df.index, columns=df.columns, data=0)
# shuffle until you get no repeat coordinates
# this is so you don't increment the same cell in the matrix twice
repeats = 999
while repeats > 1:
pairs = list(zip(np.random.permutation(rowlist), np.random.permutation(collist)))
repeats = pd.value_counts(pairs).max()
# populate new data frame at indicated points
for i_row, i_col in pairs:
new_df.at[i_row, i_col] += 1
return new_df
问题在于蛮力方法的扩展性很差。 (正如印第安纳琼斯和最后的圣战中的那句台词:https://youtu.be/Ubw5N8iVDHI?t=3)
作为一个快速演示,对于 n*n 矩阵,获得可接受的随机播放所需的尝试次数如下所示:(一次 运行)
n attempts
2 1
3 2
4 4
5 1
6 1
7 11
8 9
9 22
10 4416
11 800
12 66
13 234
14 5329
15 26501
16 27555
17 5932
18 668902
...
是否有一个简单的解决方案可以保留精确的边际分布(或告诉您哪里没有其他模式可以保留该分布)?
作为后备方案,我还可以使用可以最小化每行误差平方和的近似算法。
谢谢! =)
编辑: 出于某种原因,在我写这个问题之前我没有找到现有的答案,但是在发布之后它们都出现在侧边栏中:
Is it possible to shuffle a 2D matrix while preserving row AND column frequencies?
Randomize matrix in perl, keeping row and column totals the same
有时候你需要做的就是问...
主要感谢 的启发,这是一个似乎有效的解决方案:
def flip1(m):
"""
Chooses a single (i0, j0) location in the matrix to 'flip'
Then randomly selects a different (i, j) location that creates
a quad [(i0, j0), (i0, j), (i, j0), (i, j) in which flipping every
element leaves the marginal distributions unaltered.
Changes those elements, and returns 1.
If such a quad cannot be completed from the original position,
does nothing and returns 0.
"""
i0 = np.random.randint(m.shape[0])
j0 = np.random.randint(m.shape[1])
level = m[i0, j0]
flip = 0 if level == 1 else 1 # the opposite value
for i in np.random.permutation(range(m.shape[0])): # try in random order
if (i != i0 and # don't swap with self
m[i, j0] != level): # maybe swap with a cell that holds opposite value
for j in np.random.permutation(range(m.shape[1])):
if (j != j0 and # don't swap with self
m[i, j] == level and # check that other swaps work
m[i0, j] != level):
# make the swaps
m[i0, j0] = flip
m[i0, j] = level
m[i, j0] = level
m[i, j] = flip
return 1
return 0
def shuffle(m1, n=100):
m2 = m1.copy()
f_success = np.mean([flip1(m2) for _ in range(n)])
# f_success is the fraction of flip attempts that succeed, for diagnostics
#print(f_success)
# check the answer
assert(all(m1.sum(axis=1) == m2.sum(axis=1)))
assert(all(m1.sum(axis=0) == m2.sum(axis=0)))
return m2
我们可以称其为:
m1 = np.random.binomial(1, .3, size=(6,8))
array([[0, 0, 0, 1, 1, 0, 0, 1],
[1, 0, 0, 0, 0, 0, 1, 0],
[0, 0, 0, 1, 0, 1, 0, 1],
[1, 1, 0, 0, 0, 1, 0, 1],
[0, 0, 0, 0, 0, 1, 0, 0],
[1, 0, 1, 0, 1, 0, 0, 0]])
m2 = shuffle(m1)
array([[0, 0, 0, 0, 1, 1, 0, 1],
[1, 0, 0, 0, 0, 1, 0, 0],
[0, 0, 0, 1, 0, 0, 1, 1],
[1, 1, 1, 0, 1, 0, 0, 0],
[0, 0, 0, 0, 0, 1, 0, 0],
[1, 0, 0, 1, 0, 0, 0, 1]])
我们需要多少次迭代才能达到稳态分布?我在这里设置了默认值 100,这对于这些小矩阵来说已经足够了。
下面我绘制了不同迭代次数的原始矩阵和打乱矩阵(500 次)之间的相关性。
for _ in range(500):
m1 = np.random.binomial(1, .3, size=(9,9)) # create starting df
m2 = shuffle(m1, n_iters)
corrs.append(np.corrcoef(m1.flatten(), m2.flatten())[1,0])
plt.hist(corrs, bins=40, alpha=.4, label=n_iters)
对于 9x9 矩阵,我们看到改进直到大约 25 次迭代,超过它我们处于稳定状态。
对于 18x18 矩阵,我们看到从 100 次迭代到 250 次迭代有小幅提升,但不会超出太多。
请注意,对于较大的矩阵,开始和结束分布之间的相关性较低,但我们需要更长的时间才能到达那里。
您必须寻找两行两列,它们的切点给出一个矩阵,顶部为 1 0,底部为 0 1(或相反)。您可以切换这些值(到 01 和 10)。
Verhelst (2008, link to article page) 开发的算法甚至可以从具有相同边缘的所有可能矩阵中采样(在 R 包 RaschSampler 中实现)。
Wang 的更新算法(2020,link)也可用,在某些情况下效率更高。