`sklearn.neighbors.KernelDensity`如何处理高维数据溢出?我如何在 tensorflow 中模仿这个边界特征?
How `sklearn.neighbors.KernelDensity` deal with overflow when high-dimensional datas? And how can I mimic this boundary feature in tensorflow?
我正在尝试通过 tensorflow 模拟 sklearn.neighbors.KernelDensity
的“高斯核密度”计算。毫无疑问,如果所有操作都可以转换为图形,计算速度将加快tf.function
。这是我比较成功的代码:
import tensorflow as tf
import numpy as np
from sklearn.neighbors import KernelDensity
tf2pi = tf.constant(2*np.pi,dtype=tf.float64)
def log_gauss_norm(h,d):
return -0.5*d*tf.math.log(tf2pi)-d*tf.math.log(h)
def gauss(x,d,h):
y = log_gauss_norm(h,d)-0.5*tf.reduce_sum(x**2,axis=-1)
return tf.math.exp(y)
@tf.function
def my_kde(x,data_array,bandwidth=2.):
n_features = tf.cast(float(data_array.shape[-1]),tf.float64)
bandwidth = tf.cast(bandwidth,tf.float64)
assert len(x.shape)==2
x = x[:,tf.newaxis,:]
y = gauss((x-data_array)/bandwidth,d=n_features,h=bandwidth)
y = tf.reduce_mean(y,axis=-1)
return tf.math.log(y)
# succeed
np.random.seed(0)
basic = np.array(np.random.normal(0,1.0,size=[10000,40]),dtype=np.float64)
kde = KernelDensity(kernel='gaussian',bandwidth=2).fit(basic)
y1 = my_kde(basic[0:2],basic)
tf.print(y1) # [-73.09079498452077 -71.975842500329691]
y2 = kde.score_samples(basic[0:2])
print(y2) # [-73.09079498 -71.9758425 ]
assert all(np.isclose(y1-y2,0.0))
# overflow
np.random.seed(0)
basic = np.array(np.random.normal(0,1.0,size=[10000,800]),dtype=np.float64)
kde = KernelDensity(kernel='gaussian',bandwidth=2).fit(basic)
y1 = my_kde(basic[0:2],basic)
tf.print(y1) # [-inf -inf]
y2 = kde.score_samples(basic[0:2])
print(y2) # [-1298.87891138 -1298.87891138]
问题是,如果处理高维数据,比如上面代码中的800
,我的模拟得到-inf
,而sklearn.neighbors.KernelDensity
仍然有效,只是转到一个有意义的低边界。我想模仿下限特征。即使我尝试挖掘源代码,发现关键代码写在源代码sklearn\neighbors\_binary_tree.pxi
中的_kde_single_breadthfirst()
函数中,但我无法理解这个函数。所以,我画在这里求助。
抱歉,由于我缺乏计算机基础知识,一开始,我不明白在估计密度时,为什么要将数据存储在树结构中。但是现在,我可以将这个问题归类为如何在tensorflow中模仿一个kd-tree或ball-tree,然后搜索、计算和处理边界
我正在尝试通过 tensorflow 模拟 sklearn.neighbors.KernelDensity
的“高斯核密度”计算。毫无疑问,如果所有操作都可以转换为图形,计算速度将加快tf.function
。这是我比较成功的代码:
import tensorflow as tf
import numpy as np
from sklearn.neighbors import KernelDensity
tf2pi = tf.constant(2*np.pi,dtype=tf.float64)
def log_gauss_norm(h,d):
return -0.5*d*tf.math.log(tf2pi)-d*tf.math.log(h)
def gauss(x,d,h):
y = log_gauss_norm(h,d)-0.5*tf.reduce_sum(x**2,axis=-1)
return tf.math.exp(y)
@tf.function
def my_kde(x,data_array,bandwidth=2.):
n_features = tf.cast(float(data_array.shape[-1]),tf.float64)
bandwidth = tf.cast(bandwidth,tf.float64)
assert len(x.shape)==2
x = x[:,tf.newaxis,:]
y = gauss((x-data_array)/bandwidth,d=n_features,h=bandwidth)
y = tf.reduce_mean(y,axis=-1)
return tf.math.log(y)
# succeed
np.random.seed(0)
basic = np.array(np.random.normal(0,1.0,size=[10000,40]),dtype=np.float64)
kde = KernelDensity(kernel='gaussian',bandwidth=2).fit(basic)
y1 = my_kde(basic[0:2],basic)
tf.print(y1) # [-73.09079498452077 -71.975842500329691]
y2 = kde.score_samples(basic[0:2])
print(y2) # [-73.09079498 -71.9758425 ]
assert all(np.isclose(y1-y2,0.0))
# overflow
np.random.seed(0)
basic = np.array(np.random.normal(0,1.0,size=[10000,800]),dtype=np.float64)
kde = KernelDensity(kernel='gaussian',bandwidth=2).fit(basic)
y1 = my_kde(basic[0:2],basic)
tf.print(y1) # [-inf -inf]
y2 = kde.score_samples(basic[0:2])
print(y2) # [-1298.87891138 -1298.87891138]
问题是,如果处理高维数据,比如上面代码中的800
,我的模拟得到-inf
,而sklearn.neighbors.KernelDensity
仍然有效,只是转到一个有意义的低边界。我想模仿下限特征。即使我尝试挖掘源代码,发现关键代码写在源代码sklearn\neighbors\_binary_tree.pxi
中的_kde_single_breadthfirst()
函数中,但我无法理解这个函数。所以,我画在这里求助。
抱歉,由于我缺乏计算机基础知识,一开始,我不明白在估计密度时,为什么要将数据存储在树结构中。但是现在,我可以将这个问题归类为如何在tensorflow中模仿一个kd-tree或ball-tree,然后搜索、计算和处理边界