使用 pymc 拟合二项分布会引发某些 FillValues 的 ZeroProbability 错误
Fitting a Binomial distribution with pymc raises ZeroProbability error for certain FillValues
我不确定我是否在 pymc 中发现了错误。似乎用缺失数据拟合二项式会产生 ZeroProbability
错误,具体取决于所选择的 fill_value 来掩盖缺失数据。但也许我用错了。
我使用 github 中的当前主分支尝试了以下示例。我知道 ,但这似乎是一个不同的问题。
我用 pymc 安装了一个二项式分布,一切都如我所料:
import scipy as sp
import pymc
def make_model(observed_values):
p = pymc.Uniform('p', lower = 0.0, upper = 1.0, value = 0.1)
values = pymc.Binomial('values', n = 10* sp.ones_like(observed_values), p = p * sp.ones_like(observed_values),\
value = observed_values, observed = True, plot = False)
values = pymc.Binomial('values', n = 10, p = p,\
value = observed_values, observed = True, plot = False)
return locals()
sp.random.seed(0)
observed_values = sp.random.binomial(n = 10.0, p = 0.1, size = 100)
M1 = pymc.MCMC(make_model(observed_values))
M1.sample(iter=10000, burn=1000, thin=10)
pymc.Matplot.plot(M1)
M1.summary()
输出:
[-----------------100%-----------------] 10000 of 10000 complete in 0.7 sec
Plotting p
p:
Mean SD MC Error 95% HPD interval
------------------------------------------------------------------
0.093 0.007 0.0 [ 0.081 0.107]
Posterior quantiles:
2.5 25 50 75 97.5
|---------------|===============|===============|---------------|
0.08 0.088 0.093 0.097 0.106
现在,我尝试了一种非常相似的情况,不同之处在于将缺少一个观测值:
mask = sp.zeros_like(observed_values)
mask[0] = True
masked_values = sp.ma.masked_array(observed_values, mask = mask, fill_value = 999999)
M2 = pymc.MCMC(make_model(masked_values))
M2.sample(iter=10000, burn=1000, thin=10)
pymc.Matplot.plot(M2)
M2.summary()
没想到,我得到了一个ZeroProbability
错误:
---------------------------------------------------------------------------
ZeroProbability Traceback (most recent call last)
<ipython-input-16-4f945f269628> in <module>()
----> 1 M2 = pymc.MCMC(make_model(masked_values))
2 M2.sample(iter=10000, burn=1000, thin=10)
3 pymc.Matplot.plot(M2)
4 M2.summary()
<ipython-input-12-cb8707bb911f> in make_model(observed_values)
4 def make_model(observed_values):
5 p = pymc.Uniform('p', lower = 0.0, upper = 1.0, value = 0.1)
----> 6 values = pymc.Binomial('values', n = 10* sp.ones_like(observed_values), p = p * sp.ones_like(observed_values), value = observed_values, observed = True, plot = False)
7 values = pymc.Binomial('values', n = 10, p = p, value = observed_values, observed = True, plot = False)
8 return locals()
/home/fabian/anaconda/lib/python2.7/site-packages/pymc/distributions.pyc in __init__(self, *args, **kwds)
318 logp_partial_gradients=logp_partial_gradients,
319 dtype=dtype,
--> 320 **arg_dict_out)
321
322 new_class.__name__ = name
/home/fabian/anaconda/lib/python2.7/site-packages/pymc/PyMCObjects.pyc in __init__(self, logp, doc, name, parents, random, trace, value, dtype, rseed, observed, cache_depth, plot, verbose, isdata, check_logp, logp_partial_gradients)
773 if check_logp:
774 # Check initial value
--> 775 if not isinstance(self.logp, float):
776 raise ValueError(
777 "Stochastic " +
/home/fabian/anaconda/lib/python2.7/site-packages/pymc/PyMCObjects.pyc in get_logp(self)
930 (self._value, self._parents.value))
931 else:
--> 932 raise ZeroProbability(self.errmsg)
933
934 return logp
ZeroProbability: Stochastic values's value is outside its support,
or it forbids its parents' current values.
但是,如果我将掩码数组中的填充值更改为 1,拟合将再次起作用:
masked_values2 = sp.ma.masked_array(observed_values, mask = mask, fill_value = 1)
M3 = pymc.MCMC(make_model(masked_values2))
M3.sample(iter=10000, burn=1000, thin=10)
pymc.Matplot.plot(M3)
M3.summary()
输出:
[-----------------100%-----------------] 10000 of 10000 complete in 2.1 sec
Plotting p
p:
Mean SD MC Error 95% HPD interval
------------------------------------------------------------------
0.092 0.007 0.0 [ 0.079 0.105]
Posterior quantiles:
2.5 25 50 75 97.5
|---------------|===============|===============|---------------|
0.079 0.088 0.092 0.097 0.105
values:
Mean SD MC Error 95% HPD interval
------------------------------------------------------------------
1.15 0.886 0.029 [ 0. 3.]
Posterior quantiles:
2.5 25 50 75 97.5
|---------------|===============|===============|---------------|
0.0 1.0 1.0 2.0 3.0
这是错误还是我的模型有问题?
感谢您的帮助!
我在 GitHub 上问了 pymc 开发人员这个问题,我从 fonnesbeck (https://github.com/pymc-devs/pymc/issues/47#issuecomment-129301002) 那里得到了答案:
This is because you filled the missing values with 999999, which is outside the support of the variable. You need to give it a valid value, but not one that occurs in the non-missing data."
为确保缺失值不在非缺失数据中,建议执行以下操作:
You can give it a non-integer value in order to avoid the problem that you cite. For example, if you fill with, say, 1.5 that should work.
(...) PyMC calculates the log-probability at the first iteration, and therefore the values inserted for the missing values at the first iteration have to be valid. If you give discrete values a floating point value, it should end up getting truncated when converted to integers, so that will work."
我不确定我是否在 pymc 中发现了错误。似乎用缺失数据拟合二项式会产生 ZeroProbability
错误,具体取决于所选择的 fill_value 来掩盖缺失数据。但也许我用错了。
我使用 github 中的当前主分支尝试了以下示例。我知道
我用 pymc 安装了一个二项式分布,一切都如我所料:
import scipy as sp
import pymc
def make_model(observed_values):
p = pymc.Uniform('p', lower = 0.0, upper = 1.0, value = 0.1)
values = pymc.Binomial('values', n = 10* sp.ones_like(observed_values), p = p * sp.ones_like(observed_values),\
value = observed_values, observed = True, plot = False)
values = pymc.Binomial('values', n = 10, p = p,\
value = observed_values, observed = True, plot = False)
return locals()
sp.random.seed(0)
observed_values = sp.random.binomial(n = 10.0, p = 0.1, size = 100)
M1 = pymc.MCMC(make_model(observed_values))
M1.sample(iter=10000, burn=1000, thin=10)
pymc.Matplot.plot(M1)
M1.summary()
输出:
[-----------------100%-----------------] 10000 of 10000 complete in 0.7 sec
Plotting p
p:
Mean SD MC Error 95% HPD interval
------------------------------------------------------------------
0.093 0.007 0.0 [ 0.081 0.107]
Posterior quantiles:
2.5 25 50 75 97.5
|---------------|===============|===============|---------------|
0.08 0.088 0.093 0.097 0.106
现在,我尝试了一种非常相似的情况,不同之处在于将缺少一个观测值:
mask = sp.zeros_like(observed_values)
mask[0] = True
masked_values = sp.ma.masked_array(observed_values, mask = mask, fill_value = 999999)
M2 = pymc.MCMC(make_model(masked_values))
M2.sample(iter=10000, burn=1000, thin=10)
pymc.Matplot.plot(M2)
M2.summary()
没想到,我得到了一个ZeroProbability
错误:
---------------------------------------------------------------------------
ZeroProbability Traceback (most recent call last)
<ipython-input-16-4f945f269628> in <module>()
----> 1 M2 = pymc.MCMC(make_model(masked_values))
2 M2.sample(iter=10000, burn=1000, thin=10)
3 pymc.Matplot.plot(M2)
4 M2.summary()
<ipython-input-12-cb8707bb911f> in make_model(observed_values)
4 def make_model(observed_values):
5 p = pymc.Uniform('p', lower = 0.0, upper = 1.0, value = 0.1)
----> 6 values = pymc.Binomial('values', n = 10* sp.ones_like(observed_values), p = p * sp.ones_like(observed_values), value = observed_values, observed = True, plot = False)
7 values = pymc.Binomial('values', n = 10, p = p, value = observed_values, observed = True, plot = False)
8 return locals()
/home/fabian/anaconda/lib/python2.7/site-packages/pymc/distributions.pyc in __init__(self, *args, **kwds)
318 logp_partial_gradients=logp_partial_gradients,
319 dtype=dtype,
--> 320 **arg_dict_out)
321
322 new_class.__name__ = name
/home/fabian/anaconda/lib/python2.7/site-packages/pymc/PyMCObjects.pyc in __init__(self, logp, doc, name, parents, random, trace, value, dtype, rseed, observed, cache_depth, plot, verbose, isdata, check_logp, logp_partial_gradients)
773 if check_logp:
774 # Check initial value
--> 775 if not isinstance(self.logp, float):
776 raise ValueError(
777 "Stochastic " +
/home/fabian/anaconda/lib/python2.7/site-packages/pymc/PyMCObjects.pyc in get_logp(self)
930 (self._value, self._parents.value))
931 else:
--> 932 raise ZeroProbability(self.errmsg)
933
934 return logp
ZeroProbability: Stochastic values's value is outside its support,
or it forbids its parents' current values.
但是,如果我将掩码数组中的填充值更改为 1,拟合将再次起作用:
masked_values2 = sp.ma.masked_array(observed_values, mask = mask, fill_value = 1)
M3 = pymc.MCMC(make_model(masked_values2))
M3.sample(iter=10000, burn=1000, thin=10)
pymc.Matplot.plot(M3)
M3.summary()
输出:
[-----------------100%-----------------] 10000 of 10000 complete in 2.1 sec
Plotting p
p:
Mean SD MC Error 95% HPD interval
------------------------------------------------------------------
0.092 0.007 0.0 [ 0.079 0.105]
Posterior quantiles:
2.5 25 50 75 97.5
|---------------|===============|===============|---------------|
0.079 0.088 0.092 0.097 0.105
values:
Mean SD MC Error 95% HPD interval
------------------------------------------------------------------
1.15 0.886 0.029 [ 0. 3.]
Posterior quantiles:
2.5 25 50 75 97.5
|---------------|===============|===============|---------------|
0.0 1.0 1.0 2.0 3.0
这是错误还是我的模型有问题? 感谢您的帮助!
我在 GitHub 上问了 pymc 开发人员这个问题,我从 fonnesbeck (https://github.com/pymc-devs/pymc/issues/47#issuecomment-129301002) 那里得到了答案:
This is because you filled the missing values with 999999, which is outside the support of the variable. You need to give it a valid value, but not one that occurs in the non-missing data."
为确保缺失值不在非缺失数据中,建议执行以下操作:
You can give it a non-integer value in order to avoid the problem that you cite. For example, if you fill with, say, 1.5 that should work.
(...) PyMC calculates the log-probability at the first iteration, and therefore the values inserted for the missing values at the first iteration have to be valid. If you give discrete values a floating point value, it should end up getting truncated when converted to integers, so that will work."