Python 中计算 Kullback–Leibler 散度的有效方法
Efficient way of computing Kullback–Leibler divergence in Python
我必须计算数千个离散概率向量之间的 Kullback-Leibler Divergence (KLD)。目前我正在使用以下代码,但它对我的目的来说太慢了。我想知道是否有更快的方法来计算 KL 散度?
import numpy as np
import scipy.stats as sc
#n is the number of data points
kld = np.zeros(n, n)
for i in range(0, n):
for j in range(0, n):
if(i != j):
kld[i, j] = sc.entropy(distributions[i, :], distributions[j, :])
Scipy 的 stats.entropy
in its default sense invites inputs as 1D arrays giving us a scalar, which is being done in the listed question. Internally this function also allows broadcasting
,我们可以在此处滥用 以获得矢量化解决方案。
来自docs
-
scipy.stats.entropy(pk, qk=None, base=None)
If only probabilities pk
are given, the entropy is calculated as S = -sum(pk * log(pk),
axis=0).
If qk is not None, then compute the Kullback-Leibler divergence S =
sum(pk * log(pk / qk), axis=0).
在我们的例子中,我们针对所有行对每一行进行这些熵计算,执行总和缩减以在这两个嵌套循环的每次迭代中都有一个标量。因此,输出数组的形状为 (M,M)
,其中 M
是输入数组中的行数。
现在,这里的问题是 stats.entropy()
会和 axis=0
相加,所以我们将给它提供两个版本的 distributions
,这两个版本都会带来行数维度到 axis=0
沿着它减少,其他两个轴交错 - (M,1)
& (1,M)
使用 broadcasting
.[=28 给我们一个 (M,M)
形状的输出数组=]
因此,解决我们的案例的矢量化且更有效的方法是 -
from scipy import stats
kld = stats.entropy(distributions.T[:,:,None], distributions.T[:,None,:])
运行时测试和验证 -
In [15]: def entropy_loopy(distrib):
...: n = distrib.shape[0] #n is the number of data points
...: kld = np.zeros((n, n))
...: for i in range(0, n):
...: for j in range(0, n):
...: if(i != j):
...: kld[i, j] = stats.entropy(distrib[i, :], distrib[j, :])
...: return kld
...:
In [16]: distrib = np.random.randint(0,9,(100,100)) # Setup input
In [17]: out = stats.entropy(distrib.T[:,:,None], distrib.T[:,None,:])
In [18]: np.allclose(entropy_loopy(distrib),out) # Verify
Out[18]: True
In [19]: %timeit entropy_loopy(distrib)
1 loops, best of 3: 800 ms per loop
In [20]: %timeit stats.entropy(distrib.T[:,:,None], distrib.T[:,None,:])
10 loops, best of 3: 104 ms per loop
我必须计算数千个离散概率向量之间的 Kullback-Leibler Divergence (KLD)。目前我正在使用以下代码,但它对我的目的来说太慢了。我想知道是否有更快的方法来计算 KL 散度?
import numpy as np
import scipy.stats as sc
#n is the number of data points
kld = np.zeros(n, n)
for i in range(0, n):
for j in range(0, n):
if(i != j):
kld[i, j] = sc.entropy(distributions[i, :], distributions[j, :])
Scipy 的 stats.entropy
in its default sense invites inputs as 1D arrays giving us a scalar, which is being done in the listed question. Internally this function also allows broadcasting
,我们可以在此处滥用 以获得矢量化解决方案。
来自docs
-
scipy.stats.entropy(pk, qk=None, base=None)
If only probabilities pk are given, the entropy is calculated as S = -sum(pk * log(pk), axis=0).
If qk is not None, then compute the Kullback-Leibler divergence S = sum(pk * log(pk / qk), axis=0).
在我们的例子中,我们针对所有行对每一行进行这些熵计算,执行总和缩减以在这两个嵌套循环的每次迭代中都有一个标量。因此,输出数组的形状为 (M,M)
,其中 M
是输入数组中的行数。
现在,这里的问题是 stats.entropy()
会和 axis=0
相加,所以我们将给它提供两个版本的 distributions
,这两个版本都会带来行数维度到 axis=0
沿着它减少,其他两个轴交错 - (M,1)
& (1,M)
使用 broadcasting
.[=28 给我们一个 (M,M)
形状的输出数组=]
因此,解决我们的案例的矢量化且更有效的方法是 -
from scipy import stats
kld = stats.entropy(distributions.T[:,:,None], distributions.T[:,None,:])
运行时测试和验证 -
In [15]: def entropy_loopy(distrib):
...: n = distrib.shape[0] #n is the number of data points
...: kld = np.zeros((n, n))
...: for i in range(0, n):
...: for j in range(0, n):
...: if(i != j):
...: kld[i, j] = stats.entropy(distrib[i, :], distrib[j, :])
...: return kld
...:
In [16]: distrib = np.random.randint(0,9,(100,100)) # Setup input
In [17]: out = stats.entropy(distrib.T[:,:,None], distrib.T[:,None,:])
In [18]: np.allclose(entropy_loopy(distrib),out) # Verify
Out[18]: True
In [19]: %timeit entropy_loopy(distrib)
1 loops, best of 3: 800 ms per loop
In [20]: %timeit stats.entropy(distrib.T[:,:,None], distrib.T[:,None,:])
10 loops, best of 3: 104 ms per loop