stats.ttest_ind() 与 "manual" 学生独立 t 检验的计算:不同的结果
stats.ttest_ind() vs. "manual" computation of Student's independent t-test: different results
我正在比较同一测试的 stats.ttest_ind() 与“手动”计算,得到不同的结果。
import numpy as np
import pandas as pd
import scipy.stats as stats
import math
stats.ttest_ind() 方法:
#generate data
np.random.seed(123)
df = pd.DataFrame({
'age':np.random.normal(40,5,200).round(),
'sex':np.random.choice( ['male', 'female'], 200, p=[0.4, 0.6]),
})
#define groups
men = df.age[df.sex == 'male']
women = df.age[df.sex == 'female']
#run t-test
test_stat, test_p = stats.ttest_ind(men, women)
print(test_stat, test_p)
输出:
-0.9265613940505325 0.355282312357339
手动方法:
#mean
men_mean, women_mean = men.mean(), women.mean()
#standard deviation
men_sd, women_sd = men.std(ddof=1), women.std(ddof=1)
#standard error
men_n, women_n = len(men), len(women)
men_se, women_se = men_sd/math.sqrt(men_n), women_sd/math.sqrt(women_n)
#standard error on the difference between men and women
se_diff = math.sqrt(men_se**2.0 + women_se**2.0)
#t-stat
t_stat = (men_mean - women_mean) / se_diff
#degrees of freedom
df = men_n + women_n - 2
#critical value
alpha = 0.05
cv = stats.t.ppf(1.0 - alpha, df)
# p-value
p = (1 - stats.t.cdf(abs(t_stat), df)) * 2
print(t_stat, cv, p)
输出:
-0.9244538916746341 0.3563753194455255
我们可以看到有一点不同。为什么?也许是因为 stats.ttest_ind() 如何计算自由度?非常感谢任何见解。
以下作品。这是你上面的代码,只更改了两行。
import numpy as np
import pandas as pd
import scipy.stats as stats
import math
#generate data
np.random.seed(123)
df = pd.DataFrame({
'age':np.random.normal(40,5,200).round(),
'sex':np.random.choice( ['male', 'female'], 200, p=[0.4, 0.6]),
})
#define groups
men = df.age[df.sex == 'male']
women = df.age[df.sex == 'female']
#run t-test
############################### CHANGED THE ROW BELOW HERE
test_stat, test_p = stats.ttest_ind(men, women,equal_var=False)
print(test_stat, test_p)
#mean
men_mean, women_mean = men.mean(), women.mean()
#standard deviation
men_sd, women_sd = men.std(ddof=1), women.std(ddof=1)
#standard error
men_n, women_n = len(men), len(women)
men_se, women_se = men_sd/math.sqrt(men_n), women_sd/math.sqrt(women_n)
#standard error on the difference between men and women
se_diff = math.sqrt(men_se**2.0 + women_se**2.0)
#t-stat
t_stat = (men_mean - women_mean) / se_diff
#degrees of freedom
############################### CHANGED THE ROW BELOW HERE
df = (men_sd**2/men_n + women_sd**2/women_n)**2 / ( men_sd**4/men_n**2/(men_n-1) + women_sd**4/women_n**2/(women_n-1) )
#critical value
alpha = 0.05
cv = stats.t.ppf(1.0 - alpha, df)
# p-value
p = (1 - stats.t.cdf(abs(t_stat), df)) * 2
print(t_stat, cv, p)
它输出
-0.9244538916746341 0.356441636045986
-0.9244538916746341 1.6530443278019797 0.3564416360459859
你的代码不一致的原因是:
在行 test_stat, test_p = stats.ttest_ind(men, women)
上,您接受了默认设置,即 t 检验将通过等方差假设计算。所以 scipy.stats
给你的计算是一个纯等方差 t 检验。 scipy.stats.ttest_ind
的文档中对此进行了描述
在您自己的代码中,您总体上遵循了 Welch test:您分别计算了男性和女性的均值估计值及其标准误差,并以这种方式计算了 t 统计量。
您确实在一个地方偏离了 Welch 测试:自由度计算。自由度应近似于我在代码中输入的公式(并链接到上面),但您使用了 equal-variance assumptions.
下适用的计算
如果您想了解更多关于如何计算这些统计信息的详细信息,或者为什么它们被定义为它们,或者为什么您的代码不是您所期望的,我建议您查看 https://stats.stackexchange.com/ and https://datascience.stackexchange.com/ that are more appropriate for statistics questions, in comparison to https://whosebug.com/关于编程。这两个社区都精通 python,因此他们应该能够很好地帮助您。
我正在比较同一测试的 stats.ttest_ind() 与“手动”计算,得到不同的结果。
import numpy as np
import pandas as pd
import scipy.stats as stats
import math
stats.ttest_ind() 方法:
#generate data
np.random.seed(123)
df = pd.DataFrame({
'age':np.random.normal(40,5,200).round(),
'sex':np.random.choice( ['male', 'female'], 200, p=[0.4, 0.6]),
})
#define groups
men = df.age[df.sex == 'male']
women = df.age[df.sex == 'female']
#run t-test
test_stat, test_p = stats.ttest_ind(men, women)
print(test_stat, test_p)
输出:
-0.9265613940505325 0.355282312357339
手动方法:
#mean
men_mean, women_mean = men.mean(), women.mean()
#standard deviation
men_sd, women_sd = men.std(ddof=1), women.std(ddof=1)
#standard error
men_n, women_n = len(men), len(women)
men_se, women_se = men_sd/math.sqrt(men_n), women_sd/math.sqrt(women_n)
#standard error on the difference between men and women
se_diff = math.sqrt(men_se**2.0 + women_se**2.0)
#t-stat
t_stat = (men_mean - women_mean) / se_diff
#degrees of freedom
df = men_n + women_n - 2
#critical value
alpha = 0.05
cv = stats.t.ppf(1.0 - alpha, df)
# p-value
p = (1 - stats.t.cdf(abs(t_stat), df)) * 2
print(t_stat, cv, p)
输出:
-0.9244538916746341 0.3563753194455255
我们可以看到有一点不同。为什么?也许是因为 stats.ttest_ind() 如何计算自由度?非常感谢任何见解。
以下作品。这是你上面的代码,只更改了两行。
import numpy as np
import pandas as pd
import scipy.stats as stats
import math
#generate data
np.random.seed(123)
df = pd.DataFrame({
'age':np.random.normal(40,5,200).round(),
'sex':np.random.choice( ['male', 'female'], 200, p=[0.4, 0.6]),
})
#define groups
men = df.age[df.sex == 'male']
women = df.age[df.sex == 'female']
#run t-test
############################### CHANGED THE ROW BELOW HERE
test_stat, test_p = stats.ttest_ind(men, women,equal_var=False)
print(test_stat, test_p)
#mean
men_mean, women_mean = men.mean(), women.mean()
#standard deviation
men_sd, women_sd = men.std(ddof=1), women.std(ddof=1)
#standard error
men_n, women_n = len(men), len(women)
men_se, women_se = men_sd/math.sqrt(men_n), women_sd/math.sqrt(women_n)
#standard error on the difference between men and women
se_diff = math.sqrt(men_se**2.0 + women_se**2.0)
#t-stat
t_stat = (men_mean - women_mean) / se_diff
#degrees of freedom
############################### CHANGED THE ROW BELOW HERE
df = (men_sd**2/men_n + women_sd**2/women_n)**2 / ( men_sd**4/men_n**2/(men_n-1) + women_sd**4/women_n**2/(women_n-1) )
#critical value
alpha = 0.05
cv = stats.t.ppf(1.0 - alpha, df)
# p-value
p = (1 - stats.t.cdf(abs(t_stat), df)) * 2
print(t_stat, cv, p)
它输出
-0.9244538916746341 0.356441636045986
-0.9244538916746341 1.6530443278019797 0.3564416360459859
你的代码不一致的原因是:
在行 test_stat, test_p = stats.ttest_ind(men, women)
上,您接受了默认设置,即 t 检验将通过等方差假设计算。所以 scipy.stats
给你的计算是一个纯等方差 t 检验。 scipy.stats.ttest_ind
在您自己的代码中,您总体上遵循了 Welch test:您分别计算了男性和女性的均值估计值及其标准误差,并以这种方式计算了 t 统计量。
您确实在一个地方偏离了 Welch 测试:自由度计算。自由度应近似于我在代码中输入的公式(并链接到上面),但您使用了 equal-variance assumptions.
下适用的计算如果您想了解更多关于如何计算这些统计信息的详细信息,或者为什么它们被定义为它们,或者为什么您的代码不是您所期望的,我建议您查看 https://stats.stackexchange.com/ and https://datascience.stackexchange.com/ that are more appropriate for statistics questions, in comparison to https://whosebug.com/关于编程。这两个社区都精通 python,因此他们应该能够很好地帮助您。