如何计算 python 中 2、2D kde 图之间的公共 volume/intersection？

Question

我有 2 组数据点：

import random
import pandas as pd
A = pd.DataFrame({'x':[random.uniform(0, 1) for i in range(0,100)], 'y':[random.uniform(0, 1) for i in range(0,100)]})
B = pd.DataFrame({'x':[random.uniform(0, 1) for i in range(0,100)], 'y':[random.uniform(0, 1) for i in range(0,100)]})

对于这些数据集中的每一个，我都可以像这样生成联合图：

import seaborn as sns
sns.jointplot(x=A["x"], y=A["y"], kind='kde')
sns.jointplot(x=B["x"], y=B["y"], kind='kde')

有没有办法计算这 2 个联合地块之间的“公共区域”？

所谓公共面积，我的意思是，如果将一个联合地块放在另一个地块“内部”，交集的总面积是多少。因此，如果您将这 2 个联合地块想象成山，并且将一座山放在另一座山中，那么一座山落在另一座山中的程度是多少？

编辑

为了让我的问题更清楚：

import matplotlib.pyplot as plt
import scipy.stats as st

def plot_2d_kde(df):
    # Extract x and y
    x = df['x']
    y = df['y']
    # Define the borders
    deltaX = (max(x) - min(x))/10
    deltaY = (max(y) - min(y))/10
    xmin = min(x) - deltaX
    xmax = max(x) + deltaX
    ymin = min(y) - deltaY
    ymax = max(y) + deltaY

    # Create meshgrid
    xx, yy = np.mgrid[xmin:xmax:100j, ymin:ymax:100j]

    # We will fit a gaussian kernel using the scipy’s gaussian_kde method
    positions = np.vstack([xx.ravel(), yy.ravel()])
    values = np.vstack([x, y])
    kernel = st.gaussian_kde(values)
    f = np.reshape(kernel(positions).T, xx.shape)

    fig = plt.figure(figsize=(13, 7))
    ax = plt.axes(projection='3d')
    surf = ax.plot_surface(xx, yy, f, rstride=1, cstride=1, cmap='coolwarm', edgecolor='none')
    ax.set_xlabel('x')
    ax.set_ylabel('y')
    ax.set_zlabel('PDF')
    ax.set_title('Surface plot of Gaussian 2D KDE')
    fig.colorbar(surf, shrink=0.5, aspect=5) # add color bar indicating the PDF
    ax.view_init(60, 35)

我有兴趣找到这 2 个 kde 地块的 interection/common 体积（只是数量）：

plot_2d_kde(A)
plot_2d_kde(B)

致谢：kde 图的代码来自 here

Answer 1

我相信这就是您要找的。我基本上是在计算两个 KDE 发行版的交集（叠加）的 space（积分）。

A = pd.DataFrame({'x':[random.uniform(0, 1) for i in range(0,100)], 'y':[random.uniform(0, 1) for i in range(0,100)]})
B = pd.DataFrame({'x':[random.uniform(0, 1) for i in range(0,100)], 'y':[random.uniform(0, 1) for i in range(0,100)]})

# KDE fro both A and B 
kde_a = scipy.stats.gaussian_kde([A.x, A.y])
kde_b = scipy.stats.gaussian_kde([B.x, B.y])

min_x = min(A.x.min(), B.x.min())
min_y = min(A.y.min(), B.y.min())
max_x = max(A.x.max(), B.x.max())
max_y = max(A.y.max(), B.y.max())

print(f"x is from {min_x} to {max_x}")
print(f"y is from {min_y} to {max_y}")
x = [a[0] for a in itertools.product(np.arange(min_x, max_x, 0.01), np.arange(min_y, max_y, 0.01))]
y = [a[1] for a in itertools.product(np.arange(min_x, max_x, 0.01), np.arange(min_y, max_y, 0.01))]

# sample across 100x100 points. 
a_dist = kde_a([x, y])
b_dist = kde_b([x, y])


print(a_dist.sum() / len(x))   # intergral of A
print(b_dist.sum() / len(x))   # intergral of B
print(np.minimum(a_dist, b_dist).sum() / len(x)) # intergral of the intersection between A and B

Answer 2

以下代码比较了通过 scipy 的 dblquad 或通过在网格上取平均值计算交叉点的体积。

备注：

对于 2D 情况（并且只有 100 个样本点），delta 似乎需要比 10% 大很多。下面的代码使用 25%。 delta 为 10% 时，f1 和 f2 的计算值约为 0.90，而理论上它们应该是 1.0。增量为 25%，这些值约为 0.994。
为了以简单的方式估算体积，平均值需要乘以面积（这里 (xmax - xmin)*(ymax - ymin)）。此外，考虑的网格点越多，近似越好。下面的代码使用了 1000x1000 个网格点。
Scipy有一些计算积分的特殊函数，比如scipy.integrate.dblquad。这比 'simple' 方法慢得多，但更精确一些。默认精度不起作用，因此下面的代码大大降低了该精度。（dblquad 输出两个数字：近似积分和错误指示。为了仅获得积分，代码中使用了 dblquad()[0]。）
同样的方法可以用于更多维度。对于 'simple' 方法，创建更多维度的网格 (xx, yy, zz = np.mgrid[xmin:xmax:100j, ymin:ymax:100j, zmin:zmax:100j])。请注意，在每个维度中细分 1000 将创建一个太大而无法使用的网格。
当使用 scipy.integrate 时，dblquad 需要替换为 tplquad 用于 3 个维度或 nquad 用于 N 个维度。这可能也会很慢，因此需要进一步降低准确性。

import numpy as np
import pandas as pd
import scipy.stats as st
from scipy.integrate import dblquad

df1 = pd.DataFrame({'x':np.random.uniform(0, 1, 100), 'y':np.random.uniform(0, 1, 100)})
df2 = pd.DataFrame({'x':np.random.uniform(0, 1, 100), 'y':np.random.uniform(0, 1, 100)})

# Extract x and y
x1 = df1['x']
y1 = df1['y']
x2 = df2['x']
y2 = df2['y']
# Define the borders
deltaX = (np.max([x1, x2]) - np.min([x1, x2])) / 4
deltaY = (np.max([y1, y2]) - np.min([y1, y2])) / 4
xmin = np.min([x1, x2]) - deltaX
xmax = np.max([x1, x2]) + deltaX
ymin = np.min([y1, y2]) - deltaY
ymax = np.max([y1, y2]) + deltaY

# fit a gaussian kernel using scipy’s gaussian_kde method
kernel1 = st.gaussian_kde(np.vstack([x1, y1]))
kernel2 = st.gaussian_kde(np.vstack([x2, y2]))

print('volumes via scipy`s dblquad (volume):')
print('  volume_f1 =', dblquad(lambda y, x: kernel1((x, y)), xmin, xmax, ymin, ymax, epsabs=1e-4, epsrel=1e-4)[0])
print('  volume_f2 =', dblquad(lambda y, x: kernel2((x, y)), xmin, xmax, ymin, ymax, epsabs=1e-4, epsrel=1e-4)[0])
print('  volume_intersection =',
    dblquad(lambda y, x: np.minimum(kernel1((x, y)), kernel2((x, y))), xmin, xmax, ymin, ymax, epsabs=1e-4, epsrel=1e-4)[0])

或者，可以计算点网格的平均值，然后将结果乘以网格的面积。请注意，np.mgrid 比通过 itertools 创建列表快得多。

# Create meshgrid
xx, yy = np.mgrid[xmin:xmax:1000j, ymin:ymax:1000j]
positions = np.vstack([xx.ravel(), yy.ravel()])
f1 = np.reshape(kernel1(positions).T, xx.shape)
f2 = np.reshape(kernel2(positions).T, xx.shape)
intersection = np.minimum(f1, f2)
print('volumes via the mean value multiplied by the area:')
print('  volume_f1 =', np.sum(f1) / f1.size * ((xmax - xmin)*(ymax - ymin)))
print('  volume_f2 =', np.sum(f2) / f2.size * ((xmax - xmin)*(ymax - ymin)))
print('  volume_intersection =', np.sum(intersection) / intersection.size * ((xmax - xmin)*(ymax - ymin)))

示例输出：

volumes via scipy`s dblquad (volume):
  volume_f1 = 0.9946974276169385
  volume_f2 = 0.9928998852123891
  volume_intersection = 0.9046421634401607
volumes via the mean value multiplied by the area:
  volume_f1 = 0.9927873844924111
  volume_f2 = 0.9910132867915901
  volume_intersection = 0.9028999384136771

如何计算 python 中 2、2D kde 图之间的公共 volume/intersection？

How to calculate the common volume/intersection between 2, 2D kde plots in python?

matplotlib

python-3.x

pandas

shapely

seaborn