statsmodels：IntegrationWarning：已达到最大细分数（50）

Question

尝试用 seaborns 绘制 CDF，然后遇到此错误：

../venv/lib/python3.7/site-packages/statsmodels/nonparametric/kde.py:178: IntegrationWarning: The maximum number of subdivisions (50) has been achieved.
  If increasing the limit yields no improvement it is advised to analyze 
  the integrand in order to determine the difficulties.  If the position of a 
  local difficulty can be determined (singularity, discontinuity) one will 
  probably gain from splitting up the interval and calling the integrator 
  on the subranges.  Perhaps a special-purpose integrator should be used.
  args=endog)[0] for i in range(1, gridsize)]

按下 return 键几分钟后

../venv/lib/python3.7/site-packages/statsmodels/nonparametric/kde.py:178: IntegrationWarning: The integral is probably divergent, or slowly convergent.
  args=endog)[0] for i in range(1, gridsize)]

代码：

plt.figure()
plt.title('my distribution')
plt.ylabel('CDF')
plt.xlabel('x-labelled')
sns.kdeplot(data,cumulative=True)
plt.show()

如果有帮助：

print(len(data))

4360700

示例数据：

print(data[:10])

[ 0.00362846  0.00123409  0.00013711 -0.00029235  0.01515175  0.02780404
  0.03610236  0.03410224  0.03887933  0.0307084 ]

不知道细分是什么，有办法增加吗？

Answer 1

kde 图是通过对每个数据点求和一个高斯钟形来创建的。对 400 万条曲线求和会产生内存和性能问题，这可能会导致函数失败。确切的错误消息可能非常含糊。

解决该问题的最简单方法是对数据进行二次抽样，至于或多或少平滑的分布，无论数据是否进行二次抽样，kde（以及累积 kde 或 cdf）看起来都非常相似。使用切片 data[::100].

，每 100^th 条目进行子采样很容易

或者，对于这么多数据，可以通过绘制排序数据与 N 个从 0 到 1 的均匀间隔的数字来绘制“真实”cdf。（其中 N 是数据点的数量。）

import matplotlib.pyplot as plt
import numpy as np
import seaborn as sns

N = 1000000
data = np.random.normal(np.repeat(np.random.uniform(10, 20, 10), N // 10), 1)
sns.kdeplot(data[::100], cumulative=True, color='g', label='cumulative kde')
q = np.linspace(0, 1, data.size)
data.sort()
plt.plot(data, q, ':r', lw=2, label='cdf from sorted data')
plt.legend()
plt.show()

请注意，在给定足够大的排序数据数组的差异的情况下，您可以采用类似但稍微复杂一些的方式来绘制“更诚实”的 kde。 np.interp 将分位数插值到规则间隔的 x 轴。由于原始差异相当参差不齐，因此需要进行一些平滑处理。

import matplotlib.pyplot as plt
import numpy as np
import seaborn as sns
import statsmodels.api as sm

N = 1000000
data = np.random.normal(np.repeat(np.random.uniform(10, 20, 10), N // 10), 1)
sns.kdeplot(data[::100], cumulative=False, color='g', label='kde')
p = np.linspace(0, 1, data.size)
data.sort()

x = np.linspace(data.min(), data.max(), 1000)
y = np.interp(x, data, p)

# use lowess filter to smoothen the curve
lowess = sm.nonparametric.lowess(np.diff(y) * 1000 / (data.max() - data.min()), (x[:-1] + x[1:]) / 2, frac=0.05)
plt.plot(lowess[:, 0], lowess[:, 1], '-r', label='smoothed diff of sorted data')

# plt.plot((x[:-1]+x[1:])/2,
#         np.convolve(np.diff(y), np.ones(20)/20, mode='same')*1000/(data.max() - data.min()),
#         label='test np.diff')

plt.legend()
plt.show()

statsmodels：IntegrationWarning：已达到最大细分数（50）

statsmodels: IntegrationWarning: The maximum number of subdivisions (50) has been achieved

matplotlib

statsmodels

seaborn