如何解释列的 z-score 以找到分布类型？

Question

我有一个包含几列的 pandas 数据框。

我根据其中一列的均值和标准差计算了 z 分数。

现在，我想知道基于 z-score 的分布是什么？根据直方图我可以看出它的正态分布。

是否有根据 z-score 判断分布类型的程序？

我是统计学的新手。所以也许我遗漏了一些非常简单的东西。

示例代码：

df[col_zscore] = (df[column] - df[column].mean())/df[column].std(ddof=0)

Answer 1

如果分布是正态分布，根据68–95–99.7规则，df[col_zscore]的68%会在-1到1之间，95% 在 -2 到 2 之间，99.7% 在 -3 到 3 之间。另一方面，对于固定数，z 分数是无穷大。

您可以通过以下函数检查它是否接近正常值或固定值：

import math
def three_sigma_rule(input):
  input = input.tolist()
  one_sigma = (len([ele for ele in input if -1<ele<1])) / len(input) * 100
  two_sigma = (len([ele for ele in input if -2<ele<2])) / len(input) * 100
  three_sigma = (len([ele for ele in input if -3<ele<3])) / len(input) * 100
  print("Percentage of the z-score between -1 to 1: {0}%".format(one_sigma))
  print("Percentage of the z-score between -2 to 2: {0}%".format(two_sigma))
  print("Percentage of the z-score between -3 to 3: {0}%".format(three_sigma))
  condition1 = math.isclose(one_sigma,68,rel_tol=0.1)
  condition2 = math.isclose(two_sigma,95,rel_tol=0.1)
  condition3 = math.isclose(three_sigma,99.7,rel_tol=0.1)
  condition4 = np.isnan(input).all()
  if condition1 and  condition2 and condition3:
    print("It is normal distribution.")      
  if condition4:
    print("It is fixed value.")

让我们生成一些随机数：

if __name__ == "__main__":
  import pandas as pd
  import numpy as np

  n = 100000
  df = pd.DataFrame(dict(
    a=np.random.normal(5,3,size=n),
    b=np.random.uniform(low=-100, high=10000, size=n),
    c=np.random.uniform(low=5, high=5, size=n),
  ))
  df['a_zscore'] = (df['a'] - df['a'].mean())/df['a'].std(ddof=0)
  df['b_zscore'] = (df['b'] - df['b'].mean())/df['b'].std(ddof=0)
  df['c_zscore'] = (df['c'] - df['c'].mean())/df['c'].std(ddof=0)

three_sigma_rule(df['a_zscore']) 的输出：

如何解释列的 z-score 以找到分布类型？

how to interpret z-score of a column to find the distribution type?

python

statistics

normal-distribution

pandas