如何找到真实数据的概率分布和参数? (Python 3)
How to find probability distribution and parameters for real data? (Python 3)
我有一个来自 sklearn
的数据集,我绘制了 load_diabetes.target
数据的分布(即 load_diabetes.data
用于预测的回归值)。
我用这个是因为它有最少的 variables/attributes 回归 sklearn.datasets
.
使用Python 3,我怎样才能得到最相似的分布类型和参数?
我所知道的 target
值都是正偏斜的(正 skew/right 偏斜)。 . . Python 中有没有一种方法可以提供一些分布,然后得到最适合 target
data/vector 的分布?或者,根据给定的数据实际建议合适?对于具有理论统计知识但几乎没有将其应用于 "real data" 经验的人来说,这将非常有用。
奖金
使用这种方法来计算 "real data" 的后验分布是否有意义?如果没有,为什么不呢?
from sklearn.datasets import load_diabetes
import matplotlib.pyplot as plt
import seaborn as sns; sns.set()
import pandas as pd
#Get Data
data = load_diabetes()
X, y_ = data.data, data.target
#Organize Data
SR_y = pd.Series(y_, name="y_ (Target Vector Distribution)")
#Plot Data
fig, ax = plt.subplots()
sns.distplot(SR_y, bins=25, color="g", ax=ax)
plt.show()
据我所知,没有自动获取样本分布类型和参数的方法(因为推断样本分布是一个统计问题本身)。
在我看来,你能做的最好的事情是:
(对于每个属性)
尝试使每个属性适合一个相当大的可能分布列表
(例如,有关 Scipy 的示例,请参阅 Fitting empirical distribution to theoretical ones with Scipy (Python)?)
评估您的所有合身性并选择最合适的。这可以通过在您的样本和每个拟合分布之间执行 Kolmogorov-Smirnov 测试来完成(您在 Scipy 中有一个实现),并选择最小化 D 的测试统计量(a.k.a.样本与拟合的差值).
奖励:这很有意义 - 因为您将在为每个变量选择合适的变量时在每个变量上构建模型 - 尽管预测的好坏取决于数据的质量和您用于拟合的分布。毕竟你是在建立一个模型。
您可以使用该代码来拟合(根据最大似然)您的数据的不同分布:
import matplotlib.pyplot as plt
import scipy
import scipy.stats
dist_names = ['gamma', 'beta', 'rayleigh', 'norm', 'pareto']
for dist_name in dist_names:
dist = getattr(scipy.stats, dist_name)
param = dist.fit(y)
# here's the parameters of your distribution, scale, location
您可以在此处查看有关如何使用获得的参数的示例片段:Fitting empirical distribution to theoretical ones with Scipy (Python)?
然后,您可以选择具有最佳对数似然的分布(还有其他标准可以匹配"best"分布,例如贝叶斯后验概率,AIC ,BIC 或 BICc 值,...)。
对于你的奖励问题,我认为没有通用的答案。如果您的数据集很重要,并且是在与真实数据相同的条件下获得的,您就可以做到。
使用这种方法
import scipy.stats as st
def get_best_distribution(data):
dist_names = ["norm", "exponweib", "weibull_max", "weibull_min", "pareto", "genextreme"]
dist_results = []
params = {}
for dist_name in dist_names:
dist = getattr(st, dist_name)
param = dist.fit(data)
params[dist_name] = param
# Applying the Kolmogorov-Smirnov test
D, p = st.kstest(data, dist_name, args=param)
print("p value for "+dist_name+" = "+str(p))
dist_results.append((dist_name, p))
# select the best fitted distribution
best_dist, best_p = (max(dist_results, key=lambda item: item[1]))
# store the name of the best fit and its p value
print("Best fitting distribution: "+str(best_dist))
print("Best p value: "+ str(best_p))
print("Parameters for the best fit: "+ str(params[best_dist]))
return best_dist, best_p, params[best_dist]
类似的问题(see here)你可能对@Michel_Baudin的答案解释感兴趣。他的代码评估了大约 40 种不同的可用 OpenTURNS 库分布,并根据 BIC 标准选择了最佳分布。看起来像这样:
import openturns as ot
sample = ot.Sample([[x] for x in your_data_list])
tested_factories = ot.DistributionFactory.GetContinuousUniVariateFactories()
best_model, best_bic = ot.FittingTest.BestModelBIC(sample, tested_factories)
此代码也有效:
import pandas as pd
import numpy as np
import scipy
from scipy import stats
#Please write below the name of the statistical distributions that you would like to check.
#Full list is here: https://docs.scipy.org/doc/scipy/reference/stats.html
dist_names = ['weibull_min','norm','weibull_max','beta',
'invgauss','uniform','gamma','expon',
'lognorm','pearson3','triang']
#Read your data and set y_std to the column that you want to fit.
y_std=pd.read_csv('my_df.csv')
y_std=y_std['column_A']
#-------------------------------------------------
chi_square_statistics = []
size=len(y_std)
# 20 equi-distant bins of observed Data
percentile_bins = np.linspace(0,100,20)
percentile_cutoffs = np.percentile(y_std, percentile_bins)
observed_frequency, bins = (np.histogram(y_std, bins=percentile_cutoffs))
cum_observed_frequency = np.cumsum(observed_frequency)
# Loop through candidate distributions
for distribution in dist_names:
# Set up distribution and get fitted distribution parameters
dist = getattr(scipy.stats, distribution)
param = dist.fit(y_std)
print("{}\n{}\n".format(dist, param))
# Get expected counts in percentile bins
# cdf of fitted sistrinution across bins
cdf_fitted = dist.cdf(percentile_cutoffs, *param)
expected_frequency = []
for bin in range(len(percentile_bins)-1):
expected_cdf_area = cdf_fitted[bin+1] - cdf_fitted[bin]
expected_frequency.append(expected_cdf_area)
# Chi-square Statistics
expected_frequency = np.array(expected_frequency) * size
cum_expected_frequency = np.cumsum(expected_frequency)
ss = sum (((cum_expected_frequency - cum_observed_frequency) ** 2) / cum_observed_frequency)
chi_square_statistics.append(ss)
#Sort by minimum ch-square statistics
results = pd.DataFrame()
results['Distribution'] = dist_names
results['chi_square'] = chi_square_statistics
results.sort_values(['chi_square'], inplace=True)
print ('\nDistributions listed by goodness of fit:')
print ('............................................')
print (results)
我有一个来自 sklearn
的数据集,我绘制了 load_diabetes.target
数据的分布(即 load_diabetes.data
用于预测的回归值)。
我用这个是因为它有最少的 variables/attributes 回归 sklearn.datasets
.
使用Python 3,我怎样才能得到最相似的分布类型和参数?
我所知道的 target
值都是正偏斜的(正 skew/right 偏斜)。 . . Python 中有没有一种方法可以提供一些分布,然后得到最适合 target
data/vector 的分布?或者,根据给定的数据实际建议合适?对于具有理论统计知识但几乎没有将其应用于 "real data" 经验的人来说,这将非常有用。
奖金 使用这种方法来计算 "real data" 的后验分布是否有意义?如果没有,为什么不呢?
from sklearn.datasets import load_diabetes
import matplotlib.pyplot as plt
import seaborn as sns; sns.set()
import pandas as pd
#Get Data
data = load_diabetes()
X, y_ = data.data, data.target
#Organize Data
SR_y = pd.Series(y_, name="y_ (Target Vector Distribution)")
#Plot Data
fig, ax = plt.subplots()
sns.distplot(SR_y, bins=25, color="g", ax=ax)
plt.show()
据我所知,没有自动获取样本分布类型和参数的方法(因为推断样本分布是一个统计问题本身)。
在我看来,你能做的最好的事情是:
(对于每个属性)
尝试使每个属性适合一个相当大的可能分布列表 (例如,有关 Scipy 的示例,请参阅 Fitting empirical distribution to theoretical ones with Scipy (Python)?)
评估您的所有合身性并选择最合适的。这可以通过在您的样本和每个拟合分布之间执行 Kolmogorov-Smirnov 测试来完成(您在 Scipy 中有一个实现),并选择最小化 D 的测试统计量(a.k.a.样本与拟合的差值).
奖励:这很有意义 - 因为您将在为每个变量选择合适的变量时在每个变量上构建模型 - 尽管预测的好坏取决于数据的质量和您用于拟合的分布。毕竟你是在建立一个模型。
您可以使用该代码来拟合(根据最大似然)您的数据的不同分布:
import matplotlib.pyplot as plt
import scipy
import scipy.stats
dist_names = ['gamma', 'beta', 'rayleigh', 'norm', 'pareto']
for dist_name in dist_names:
dist = getattr(scipy.stats, dist_name)
param = dist.fit(y)
# here's the parameters of your distribution, scale, location
您可以在此处查看有关如何使用获得的参数的示例片段:Fitting empirical distribution to theoretical ones with Scipy (Python)?
然后,您可以选择具有最佳对数似然的分布(还有其他标准可以匹配"best"分布,例如贝叶斯后验概率,AIC ,BIC 或 BICc 值,...)。
对于你的奖励问题,我认为没有通用的答案。如果您的数据集很重要,并且是在与真实数据相同的条件下获得的,您就可以做到。
使用这种方法
import scipy.stats as st
def get_best_distribution(data):
dist_names = ["norm", "exponweib", "weibull_max", "weibull_min", "pareto", "genextreme"]
dist_results = []
params = {}
for dist_name in dist_names:
dist = getattr(st, dist_name)
param = dist.fit(data)
params[dist_name] = param
# Applying the Kolmogorov-Smirnov test
D, p = st.kstest(data, dist_name, args=param)
print("p value for "+dist_name+" = "+str(p))
dist_results.append((dist_name, p))
# select the best fitted distribution
best_dist, best_p = (max(dist_results, key=lambda item: item[1]))
# store the name of the best fit and its p value
print("Best fitting distribution: "+str(best_dist))
print("Best p value: "+ str(best_p))
print("Parameters for the best fit: "+ str(params[best_dist]))
return best_dist, best_p, params[best_dist]
类似的问题(see here)你可能对@Michel_Baudin的答案解释感兴趣。他的代码评估了大约 40 种不同的可用 OpenTURNS 库分布,并根据 BIC 标准选择了最佳分布。看起来像这样:
import openturns as ot
sample = ot.Sample([[x] for x in your_data_list])
tested_factories = ot.DistributionFactory.GetContinuousUniVariateFactories()
best_model, best_bic = ot.FittingTest.BestModelBIC(sample, tested_factories)
此代码也有效:
import pandas as pd
import numpy as np
import scipy
from scipy import stats
#Please write below the name of the statistical distributions that you would like to check.
#Full list is here: https://docs.scipy.org/doc/scipy/reference/stats.html
dist_names = ['weibull_min','norm','weibull_max','beta',
'invgauss','uniform','gamma','expon',
'lognorm','pearson3','triang']
#Read your data and set y_std to the column that you want to fit.
y_std=pd.read_csv('my_df.csv')
y_std=y_std['column_A']
#-------------------------------------------------
chi_square_statistics = []
size=len(y_std)
# 20 equi-distant bins of observed Data
percentile_bins = np.linspace(0,100,20)
percentile_cutoffs = np.percentile(y_std, percentile_bins)
observed_frequency, bins = (np.histogram(y_std, bins=percentile_cutoffs))
cum_observed_frequency = np.cumsum(observed_frequency)
# Loop through candidate distributions
for distribution in dist_names:
# Set up distribution and get fitted distribution parameters
dist = getattr(scipy.stats, distribution)
param = dist.fit(y_std)
print("{}\n{}\n".format(dist, param))
# Get expected counts in percentile bins
# cdf of fitted sistrinution across bins
cdf_fitted = dist.cdf(percentile_cutoffs, *param)
expected_frequency = []
for bin in range(len(percentile_bins)-1):
expected_cdf_area = cdf_fitted[bin+1] - cdf_fitted[bin]
expected_frequency.append(expected_cdf_area)
# Chi-square Statistics
expected_frequency = np.array(expected_frequency) * size
cum_expected_frequency = np.cumsum(expected_frequency)
ss = sum (((cum_expected_frequency - cum_observed_frequency) ** 2) / cum_observed_frequency)
chi_square_statistics.append(ss)
#Sort by minimum ch-square statistics
results = pd.DataFrame()
results['Distribution'] = dist_names
results['chi_square'] = chi_square_statistics
results.sort_values(['chi_square'], inplace=True)
print ('\nDistributions listed by goodness of fit:')
print ('............................................')
print (results)