如何解释列的 z-score 以找到分布类型?
how to interpret z-score of a column to find the distribution type?
我有一个包含几列的 pandas 数据框。
我根据其中一列的均值和标准差计算了 z 分数。
现在,我想知道基于 z-score 的分布是什么?根据直方图我可以看出它的正态分布。
是否有根据 z-score 判断分布类型的程序?
我是统计学的新手。所以也许我遗漏了一些非常简单的东西。
示例代码:
df[col_zscore] = (df[column] - df[column].mean())/df[column].std(ddof=0)
如果分布是正态分布,根据68–95–99.7
规则,df[col_zscore]
的68%
会在-1
到1
之间,95%
在 -2
到 2
之间,99.7%
在 -3
到 3
之间。另一方面,对于固定数,z 分数是无穷大。
您可以通过以下函数检查它是否接近正常值或固定值:
import math
def three_sigma_rule(input):
input = input.tolist()
one_sigma = (len([ele for ele in input if -1<ele<1])) / len(input) * 100
two_sigma = (len([ele for ele in input if -2<ele<2])) / len(input) * 100
three_sigma = (len([ele for ele in input if -3<ele<3])) / len(input) * 100
print("Percentage of the z-score between -1 to 1: {0}%".format(one_sigma))
print("Percentage of the z-score between -2 to 2: {0}%".format(two_sigma))
print("Percentage of the z-score between -3 to 3: {0}%".format(three_sigma))
condition1 = math.isclose(one_sigma,68,rel_tol=0.1)
condition2 = math.isclose(two_sigma,95,rel_tol=0.1)
condition3 = math.isclose(three_sigma,99.7,rel_tol=0.1)
condition4 = np.isnan(input).all()
if condition1 and condition2 and condition3:
print("It is normal distribution.")
if condition4:
print("It is fixed value.")
让我们生成一些随机数:
if __name__ == "__main__":
import pandas as pd
import numpy as np
n = 100000
df = pd.DataFrame(dict(
a=np.random.normal(5,3,size=n),
b=np.random.uniform(low=-100, high=10000, size=n),
c=np.random.uniform(low=5, high=5, size=n),
))
df['a_zscore'] = (df['a'] - df['a'].mean())/df['a'].std(ddof=0)
df['b_zscore'] = (df['b'] - df['b'].mean())/df['b'].std(ddof=0)
df['c_zscore'] = (df['c'] - df['c'].mean())/df['c'].std(ddof=0)
three_sigma_rule(df['a_zscore'])
的输出:
我有一个包含几列的 pandas 数据框。
我根据其中一列的均值和标准差计算了 z 分数。
现在,我想知道基于 z-score 的分布是什么?根据直方图我可以看出它的正态分布。
是否有根据 z-score 判断分布类型的程序?
我是统计学的新手。所以也许我遗漏了一些非常简单的东西。
示例代码:
df[col_zscore] = (df[column] - df[column].mean())/df[column].std(ddof=0)
如果分布是正态分布,根据68–95–99.7
规则,df[col_zscore]
的68%
会在-1
到1
之间,95%
在 -2
到 2
之间,99.7%
在 -3
到 3
之间。另一方面,对于固定数,z 分数是无穷大。
您可以通过以下函数检查它是否接近正常值或固定值:
import math
def three_sigma_rule(input):
input = input.tolist()
one_sigma = (len([ele for ele in input if -1<ele<1])) / len(input) * 100
two_sigma = (len([ele for ele in input if -2<ele<2])) / len(input) * 100
three_sigma = (len([ele for ele in input if -3<ele<3])) / len(input) * 100
print("Percentage of the z-score between -1 to 1: {0}%".format(one_sigma))
print("Percentage of the z-score between -2 to 2: {0}%".format(two_sigma))
print("Percentage of the z-score between -3 to 3: {0}%".format(three_sigma))
condition1 = math.isclose(one_sigma,68,rel_tol=0.1)
condition2 = math.isclose(two_sigma,95,rel_tol=0.1)
condition3 = math.isclose(three_sigma,99.7,rel_tol=0.1)
condition4 = np.isnan(input).all()
if condition1 and condition2 and condition3:
print("It is normal distribution.")
if condition4:
print("It is fixed value.")
让我们生成一些随机数:
if __name__ == "__main__":
import pandas as pd
import numpy as np
n = 100000
df = pd.DataFrame(dict(
a=np.random.normal(5,3,size=n),
b=np.random.uniform(low=-100, high=10000, size=n),
c=np.random.uniform(low=5, high=5, size=n),
))
df['a_zscore'] = (df['a'] - df['a'].mean())/df['a'].std(ddof=0)
df['b_zscore'] = (df['b'] - df['b'].mean())/df['b'].std(ddof=0)
df['c_zscore'] = (df['c'] - df['c'].mean())/df['c'].std(ddof=0)
three_sigma_rule(df['a_zscore'])
的输出: