计算频率数据帧的模式、中值和偏度
Calculate Mode, Median and Skewness of frequency dataframe
我有一个这样的数据框:
Category Frequency
1 30000
2 45000
3 32400
4 42200
5 56300
6 98200
如何计算类别的均值、中值和偏度?
我试过以下方法:
df['cum_freq'] = [df["Category"]]*df["Frequnecy"]
mean = df['cum_freq'].mean()
median = df['cum_freq'].median()
skew = df['cum_freq'].skew()
如果总频率足够小以适合内存,请使用repeat
生成数据,然后您可以轻松调用这些方法。
s = df['Category'].repeat(df['Frequency']).reset_index(drop=True)
print(s.mean(), s.var(ddof=1), s.skew(), s.kurtosis())
# 4.13252219664584 3.045585008424625 -0.4512924988072343 -1.1923306818513022
否则,你将需要更复杂的代数来计算力矩,这可以用 k-statistics 来完成一些较低的力矩可以用其他库来完成,比如 numpy
或 statsmodels
.但是对于诸如偏度和峰度之类的事情,这是根据 de-meaned 值(根据计数计算)的总和手动完成的。由于这些和会溢出 numpy,我们需要使用正常的 python.
def moments_from_counts(vals, weights):
"""
Returns tuple (mean, N-1 variance, skewness, kurtosis) from count data
"""
vals = [float(x) for x in vals]
weights = [float(x) for x in weights]
n = sum(weights)
mu = sum([x*y for x,y in zip(vals,weights)])/n
S1 = sum([(x-mu)**1*y for x,y in zip(vals,weights)])
S2 = sum([(x-mu)**2*y for x,y in zip(vals,weights)])
S3 = sum([(x-mu)**3*y for x,y in zip(vals,weights)])
S4 = sum([(x-mu)**4*y for x,y in zip(vals,weights)])
k1 = S1/n
k2 = (n*S2-S1**2)/(n*(n-1))
k3 = (2*S1**3 - 3*n*S1*S2 + n**2*S3)/(n*(n-1)*(n-2))
k4 = (-6*S1**4 + 12*n*S1**2*S2 - 3*n*(n-1)*S2**2 -4*n*(n+1)*S1*S3 + n**2*(n+1)*S4)/(n*(n-1)*(n-2)*(n-3))
return mu, k2, k3/k2**1.5, k4/k2**2
moments_from_counts(df['Category'], df['Frequency'])
#(4.13252219664584, 3.045585008418879, -0.4512924988072345, -1.1923306818513018)
statsmodels 有一个很好的 class 可以处理较低的矩以及分位数。
from statsmodels.stats.weightstats import DescrStatsW
d = DescrStatsW(df['Category'], weights=df['Frequency'])
d.mean
#4.13252219664584
d.var_ddof(1)
#3.045585008418879
如果您调用 d.asrepeats()
,DescrStatsW class 还允许您访问作为数组的基础数据
我有一个这样的数据框:
Category Frequency
1 30000
2 45000
3 32400
4 42200
5 56300
6 98200
如何计算类别的均值、中值和偏度?
我试过以下方法:
df['cum_freq'] = [df["Category"]]*df["Frequnecy"]
mean = df['cum_freq'].mean()
median = df['cum_freq'].median()
skew = df['cum_freq'].skew()
如果总频率足够小以适合内存,请使用repeat
生成数据,然后您可以轻松调用这些方法。
s = df['Category'].repeat(df['Frequency']).reset_index(drop=True)
print(s.mean(), s.var(ddof=1), s.skew(), s.kurtosis())
# 4.13252219664584 3.045585008424625 -0.4512924988072343 -1.1923306818513022
否则,你将需要更复杂的代数来计算力矩,这可以用 k-statistics 来完成一些较低的力矩可以用其他库来完成,比如 numpy
或 statsmodels
.但是对于诸如偏度和峰度之类的事情,这是根据 de-meaned 值(根据计数计算)的总和手动完成的。由于这些和会溢出 numpy,我们需要使用正常的 python.
def moments_from_counts(vals, weights):
"""
Returns tuple (mean, N-1 variance, skewness, kurtosis) from count data
"""
vals = [float(x) for x in vals]
weights = [float(x) for x in weights]
n = sum(weights)
mu = sum([x*y for x,y in zip(vals,weights)])/n
S1 = sum([(x-mu)**1*y for x,y in zip(vals,weights)])
S2 = sum([(x-mu)**2*y for x,y in zip(vals,weights)])
S3 = sum([(x-mu)**3*y for x,y in zip(vals,weights)])
S4 = sum([(x-mu)**4*y for x,y in zip(vals,weights)])
k1 = S1/n
k2 = (n*S2-S1**2)/(n*(n-1))
k3 = (2*S1**3 - 3*n*S1*S2 + n**2*S3)/(n*(n-1)*(n-2))
k4 = (-6*S1**4 + 12*n*S1**2*S2 - 3*n*(n-1)*S2**2 -4*n*(n+1)*S1*S3 + n**2*(n+1)*S4)/(n*(n-1)*(n-2)*(n-3))
return mu, k2, k3/k2**1.5, k4/k2**2
moments_from_counts(df['Category'], df['Frequency'])
#(4.13252219664584, 3.045585008418879, -0.4512924988072345, -1.1923306818513018)
statsmodels 有一个很好的 class 可以处理较低的矩以及分位数。
from statsmodels.stats.weightstats import DescrStatsW
d = DescrStatsW(df['Category'], weights=df['Frequency'])
d.mean
#4.13252219664584
d.var_ddof(1)
#3.045585008418879
如果您调用 d.asrepeats()