如何找到真实数据的概率分布和参数? (Python 3)

How to find probability distribution and parameters for real data? (Python 3)

我有一个来自 sklearn 的数据集,我绘制了 load_diabetes.target 数据的分布(即 load_diabetes.data 用于预测的回归值)。

我用这个是因为它有最少的 variables/attributes 回归 sklearn.datasets.

使用Python 3,我怎样才能得到最相似的分布类型和参数?

我所知道的 target 值都是正偏斜的(正 skew/right 偏斜)。 . . Python 中有没有一种方法可以提供一些分布,然后得到最适合 target data/vector 的分布?或者,根据给定的数据实际建议合适?对于具有理论统计知识但几乎没有将其应用于 "real data" 经验的人来说,这将非常有用。

奖金 使用这种方法来计算 "real data" 的后验分布是否有意义?如果没有,为什么不呢?

from sklearn.datasets import load_diabetes
import matplotlib.pyplot as plt
import seaborn as sns; sns.set()
import pandas as pd

#Get Data
data = load_diabetes()
X, y_ = data.data, data.target

#Organize Data
SR_y = pd.Series(y_, name="y_ (Target Vector Distribution)")

#Plot Data
fig, ax = plt.subplots()
sns.distplot(SR_y, bins=25, color="g", ax=ax)
plt.show()

据我所知,没有自动获取样本分布类型和参数的方法(因为推断样本分布是一个统计问题本身)。

在我看来,你能做的最好的事情是:

(对于每个属性)

  • 尝试使每个属性适合一个相当大的可能分布列表 (例如,有关 Scipy 的示例,请参阅 Fitting empirical distribution to theoretical ones with Scipy (Python)?

  • 评估您的所有合身性并选择最合适的。这可以通过在您的样本和每个拟合分布之间执行 Kolmogorov-Smirnov 测试来完成(您在 Scipy 中有一个实现),并选择最小化 D 的测试统计量(a.k.a.样本与拟合的差值).

奖励:这很有意义 - 因为您将在为每个变量选择合适的变量时在每个变量上构建模型 - 尽管预测的好坏取决于数据的质量和您用于拟合的分布。毕竟你是在建立一个模型。

您可以使用该代码来拟合(根据最大似然)您的数据的不同分布:

import matplotlib.pyplot as plt
import scipy
import scipy.stats

dist_names = ['gamma', 'beta', 'rayleigh', 'norm', 'pareto']

for dist_name in dist_names:
    dist = getattr(scipy.stats, dist_name)
    param = dist.fit(y)
    # here's the parameters of your distribution, scale, location

您可以在此处查看有关如何使用获得的参数的示例片段:Fitting empirical distribution to theoretical ones with Scipy (Python)?

然后,您可以选择具有最佳对数似然的分布(还有其他标准可以匹配"best"分布,例如贝叶斯后验概率,AIC ,BIC 或 BICc 值,...)。

对于你的奖励问题,我认为没有通用的答案。如果您的数据集很重要,并且是在与真实数据相同的条件下获得的,您就可以做到。

使用这种方法

import scipy.stats as st
def get_best_distribution(data):
    dist_names = ["norm", "exponweib", "weibull_max", "weibull_min", "pareto", "genextreme"]
    dist_results = []
    params = {}
    for dist_name in dist_names:
        dist = getattr(st, dist_name)
        param = dist.fit(data)

        params[dist_name] = param
        # Applying the Kolmogorov-Smirnov test
        D, p = st.kstest(data, dist_name, args=param)
        print("p value for "+dist_name+" = "+str(p))
        dist_results.append((dist_name, p))

    # select the best fitted distribution
    best_dist, best_p = (max(dist_results, key=lambda item: item[1]))
    # store the name of the best fit and its p value

    print("Best fitting distribution: "+str(best_dist))
    print("Best p value: "+ str(best_p))
    print("Parameters for the best fit: "+ str(params[best_dist]))

    return best_dist, best_p, params[best_dist]

类似的问题(see here)你可能对@Michel_Baudin的答案解释感兴趣。他的代码评估了大约 40 种不同的可用 OpenTURNS 库分布,并根据 BIC 标准选择了最佳分布。看起来像这样:

import openturns as ot

sample = ot.Sample([[x] for x in your_data_list])
tested_factories = ot.DistributionFactory.GetContinuousUniVariateFactories()
best_model, best_bic = ot.FittingTest.BestModelBIC(sample, tested_factories)

此代码也有效:

import pandas as pd
import numpy as np
import scipy
from scipy import stats

#Please write below the name of the statistical distributions that you would like to check.
#Full list is here: https://docs.scipy.org/doc/scipy/reference/stats.html
dist_names = ['weibull_min','norm','weibull_max','beta',
              'invgauss','uniform','gamma','expon',   
              'lognorm','pearson3','triang']

#Read your data and set y_std to the column that you want to fit.
y_std=pd.read_csv('my_df.csv')
y_std=y_std['column_A']

#-------------------------------------------------
chi_square_statistics = []
size=len(y_std)

# 20 equi-distant bins of observed Data 
percentile_bins = np.linspace(0,100,20)
percentile_cutoffs = np.percentile(y_std, percentile_bins)
observed_frequency, bins = (np.histogram(y_std, bins=percentile_cutoffs))
cum_observed_frequency = np.cumsum(observed_frequency)

# Loop through candidate distributions
for distribution in dist_names:
    # Set up distribution and get fitted distribution parameters
    dist = getattr(scipy.stats, distribution)
    param = dist.fit(y_std)
    print("{}\n{}\n".format(dist, param))

    # Get expected counts in percentile bins
    # cdf of fitted sistrinution across bins
    cdf_fitted = dist.cdf(percentile_cutoffs, *param)
    expected_frequency = []
    for bin in range(len(percentile_bins)-1):
        expected_cdf_area = cdf_fitted[bin+1] - cdf_fitted[bin]
        expected_frequency.append(expected_cdf_area)

    # Chi-square Statistics
    expected_frequency = np.array(expected_frequency) * size
    cum_expected_frequency = np.cumsum(expected_frequency)
    ss = sum (((cum_expected_frequency - cum_observed_frequency) ** 2) / cum_observed_frequency)
    chi_square_statistics.append(ss)


#Sort by minimum ch-square statistics
results = pd.DataFrame()
results['Distribution'] = dist_names
results['chi_square'] = chi_square_statistics
results.sort_values(['chi_square'], inplace=True)


print ('\nDistributions listed by goodness of fit:')
print ('............................................')
print (results)