python向量化:如何提高4层循环的效率
python vectorization: how to increase the efficiency of 4 layer loop
我正在尝试使用 Gibbs 采样实现 LDA,在更新每个主题比例的步骤中,我有一个 4 层循环,它运行起来非常慢,我不确定如何提高这段代码的效率。我现在的代码如下:
N_W是词数,N_D是文档数,Z[i,j]是主题分配(1到K可能的分配),X[i ,j]是第i篇文档中第j个词的个数,Beta[k,:]维度为[K,N_W].
更新如下:
for k in range(K): # iteratively for each topic update
n_k = np.zeros(N_W) # vocab size
for w in range(N_W):
for i in range(N_D):
for j in range(N_W):
# counting number of times a word is assigned to a topic
n_k[w] += (X[i,j] == w) and (Z[i,j] == k)
# update
Beta[k,:] = np.random.dirichlet(gamma + n_k)
您可以使用逻辑函数摆脱最后两个 for 循环:
for k in range(K): # iteratively for each topic update
n_k = np.zeros(N_W) # vocab size
for w in range(N_W):
a = np.logical_not(X-w) # all X(i,j) == w become a True, others a false
b = np.logical_not(Z-k) # all Z(i,j) == w become a True, others a false
c = np.logical_and(a,b) # all (i,j) where X(i,j) == w and Z(i,j) == k are True, others false
n_k[w] = np.sum(c) # sum all True values
甚至作为一个班轮:
n_k = np.array([[np.sum(np.logical_and(np.logical_not(X[:N_D,:N_W]-w), np.logical_not(Z[:N_D,:N_W]-k))) for w in range(N_W)] for k in range(K)])
然后 n_k 中的每一行都可以用于 beta 计算。现在它还包括 N_W 和 N_D 作为限制条件,如果它们不等于 X 和 Z 的大小
我用以下矩阵做了一些测试:
import numpy as np
K = 90
N_W = 100
N_D = 11
N_W = 12
Z = np.random.randint(0, K, size=(N_D, N_W))
X = np.random.randint(0, N_W, size=(N_D, N_W))
gamma = 1
原代码:
%%timeit
Beta = numpy.zeros((K, N_W))
for k in range(K): # iteratively for each topic update
n_k = np.zeros(N_W) # vocab size
for w in range(N_W):
for i in range(N_D):
for j in range(N_W):
# counting number of times a word is assigned to a topic
n_k[w] += (X[i,j] == w) and (Z[i,j] == k)
# update
Beta[k,:] = np.random.dirichlet(gamma + n_k)
865 ms ± 8.37 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
然后只向量化内部的两个循环:
%%timeit
Beta = numpy.zeros((K, N_W))
for k in range(K): # iteratively for each topic update
n_k = np.zeros(N_W) # vocab size
for w in range(N_W):
n_k[w] = np.sum((X == w) & (Z == k))
# update
Beta[k,:] = np.random.dirichlet(gamma + n_k)
21.6 ms ± 542 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
终于有了广播和提取公共元素的一些创造性应用:
%%timeit
Beta = numpy.zeros((K, N_W))
w = np.arange(N_W)
X_eq_w = np.equal.outer(X, w)
for k in range(K): # iteratively for each topic update
n_k = np.sum(X_eq_w & (Z == k)[:, :, None], axis=(0, 1))
# update
Beta[k,:] = np.random.dirichlet(gamma + n_k)
4.6 ms ± 92.1 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
这里的权衡是在速度和内存之间。对于我使用的形状,这不是那么占用内存,但我在上一个解决方案中构建的中间三维数组可能会变得非常大。
我正在尝试使用 Gibbs 采样实现 LDA,在更新每个主题比例的步骤中,我有一个 4 层循环,它运行起来非常慢,我不确定如何提高这段代码的效率。我现在的代码如下:
N_W是词数,N_D是文档数,Z[i,j]是主题分配(1到K可能的分配),X[i ,j]是第i篇文档中第j个词的个数,Beta[k,:]维度为[K,N_W].
更新如下:
for k in range(K): # iteratively for each topic update
n_k = np.zeros(N_W) # vocab size
for w in range(N_W):
for i in range(N_D):
for j in range(N_W):
# counting number of times a word is assigned to a topic
n_k[w] += (X[i,j] == w) and (Z[i,j] == k)
# update
Beta[k,:] = np.random.dirichlet(gamma + n_k)
您可以使用逻辑函数摆脱最后两个 for 循环:
for k in range(K): # iteratively for each topic update
n_k = np.zeros(N_W) # vocab size
for w in range(N_W):
a = np.logical_not(X-w) # all X(i,j) == w become a True, others a false
b = np.logical_not(Z-k) # all Z(i,j) == w become a True, others a false
c = np.logical_and(a,b) # all (i,j) where X(i,j) == w and Z(i,j) == k are True, others false
n_k[w] = np.sum(c) # sum all True values
甚至作为一个班轮:
n_k = np.array([[np.sum(np.logical_and(np.logical_not(X[:N_D,:N_W]-w), np.logical_not(Z[:N_D,:N_W]-k))) for w in range(N_W)] for k in range(K)])
然后 n_k 中的每一行都可以用于 beta 计算。现在它还包括 N_W 和 N_D 作为限制条件,如果它们不等于 X 和 Z 的大小
我用以下矩阵做了一些测试:
import numpy as np
K = 90
N_W = 100
N_D = 11
N_W = 12
Z = np.random.randint(0, K, size=(N_D, N_W))
X = np.random.randint(0, N_W, size=(N_D, N_W))
gamma = 1
原代码:
%%timeit
Beta = numpy.zeros((K, N_W))
for k in range(K): # iteratively for each topic update
n_k = np.zeros(N_W) # vocab size
for w in range(N_W):
for i in range(N_D):
for j in range(N_W):
# counting number of times a word is assigned to a topic
n_k[w] += (X[i,j] == w) and (Z[i,j] == k)
# update
Beta[k,:] = np.random.dirichlet(gamma + n_k)
865 ms ± 8.37 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
然后只向量化内部的两个循环:
%%timeit
Beta = numpy.zeros((K, N_W))
for k in range(K): # iteratively for each topic update
n_k = np.zeros(N_W) # vocab size
for w in range(N_W):
n_k[w] = np.sum((X == w) & (Z == k))
# update
Beta[k,:] = np.random.dirichlet(gamma + n_k)
21.6 ms ± 542 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
终于有了广播和提取公共元素的一些创造性应用:
%%timeit
Beta = numpy.zeros((K, N_W))
w = np.arange(N_W)
X_eq_w = np.equal.outer(X, w)
for k in range(K): # iteratively for each topic update
n_k = np.sum(X_eq_w & (Z == k)[:, :, None], axis=(0, 1))
# update
Beta[k,:] = np.random.dirichlet(gamma + n_k)
4.6 ms ± 92.1 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
这里的权衡是在速度和内存之间。对于我使用的形状,这不是那么占用内存,但我在上一个解决方案中构建的中间三维数组可能会变得非常大。