您如何计算 Python 中 Pearson r 的置信区间?
How do you compute the confidence interval for Pearson's r in Python?
在 Python 中,我知道如何使用 scipy.stats.pearsonr
计算 r 和关联的 p 值,但我无法找到计算 r 的置信区间的方法。这是怎么做到的?感谢您的帮助:)
使用 rpy2 和心理测量库(您需要安装 R 并首先在 R 中 运行 install.packages("psychometric")
from rpy2.robjects.packages import importr
psychometric=importr('psychometric')
psychometric.CIr(r=.9, n = 100, level = .95)
其中 0.9 是您的相关性,n 是样本量,0.95 是置信度
根据[1],直接用Pearson r计算置信区间是复杂的,因为它不是正态分布的。需要以下步骤:
- 将 r 转换为 z',
- 计算 z' 置信区间。 z'的采样分布近似正态分布,标准误差为1/sqrt(n-3)。
- 将置信区间转换回 r。
下面是一些示例代码:
def r_to_z(r):
return math.log((1 + r) / (1 - r)) / 2.0
def z_to_r(z):
e = math.exp(2 * z)
return((e - 1) / (e + 1))
def r_confidence_interval(r, alpha, n):
z = r_to_z(r)
se = 1.0 / math.sqrt(n - 3)
z_crit = stats.norm.ppf(1 - alpha/2) # 2-tailed z critical value
lo = z - z_crit * se
hi = z + z_crit * se
# Return a sequence
return (z_to_r(lo), z_to_r(hi))
参考:
bennylp给出的答案大部分是正确的,但是在第3个函数中计算临界值时有一个小错误。
应该改为:
def r_confidence_interval(r, alpha, n):
z = r_to_z(r)
se = 1.0 / math.sqrt(n - 3)
z_crit = stats.norm.ppf((1 + alpha)/2) # 2-tailed z critical value
lo = z - z_crit * se
hi = z + z_crit * se
# Return a sequence
return (z_to_r(lo), z_to_r(hi))
这里还有一个post供参考:
这是一个使用自举来计算置信区间的解决方案,而不是 Fisher 变换(假定双变量正态性等),借用 this answer:
import numpy as np
def pearsonr_ci(x, y, ci=95, n_boots=10000):
x = np.asarray(x)
y = np.asarray(y)
# (n_boots, n_observations) paired arrays
rand_ixs = np.random.randint(0, x.shape[0], size=(n_boots, x.shape[0]))
x_boots = x[rand_ixs]
y_boots = y[rand_ixs]
# differences from mean
x_mdiffs = x_boots - x_boots.mean(axis=1)[:, None]
y_mdiffs = y_boots - y_boots.mean(axis=1)[:, None]
# sums of squares
x_ss = np.einsum('ij, ij -> i', x_mdiffs, x_mdiffs)
y_ss = np.einsum('ij, ij -> i', y_mdiffs, y_mdiffs)
# pearson correlations
r_boots = np.einsum('ij, ij -> i', x_mdiffs, y_mdiffs) / np.sqrt(x_ss * y_ss)
# upper and lower bounds for confidence interval
ci_low = np.percentile(r_boots, (100 - ci) / 2)
ci_high = np.percentile(r_boots, (ci + 100) / 2)
return ci_low, ci_high
我知道上面已经建议了自举,下面建议它的另一种变体,它可能更适合其他一些设置。
#1
对您的数据进行采样(成对的 X 和 Y,也可以添加其他权重),在其上拟合原始模型,记录 r2,附加它。然后从记录的所有 R2 的分布中提取置信区间。
#2 此外可以拟合采样数据并使用采样数据模型预测非采样 X (也可以提供连续范围来扩展您的预测使用原始 X)
获得 Y 帽子的置信区间。
所以在示例代码中:
import numpy as np
from scipy.optimize import curve_fit
import pandas as pd
from sklearn.metrics import r2_score
x = np.array([your numbers here])
y = np.array([your numbers here])
### define list for R2 values
r2s = []
### define dataframe to append your bootstrapped fits for Y hat ranges
ci_df = pd.DataFrame({'x': x})
### define how many samples you want
how_many_straps = 5000
### define your fit function/s
def func_exponential(x,a,b):
return np.exp(b) * np.exp(a * x)
### fit original, using log because fitting exponential
polyfit_original = np.polyfit(x
,np.log(y)
,1
,# w= could supply weight for observations here)
)
for i in range(how_many_straps+1):
### zip into tuples attaching X to Y, can combine more variables as well
zipped_for_boot = pd.Series(tuple(zip(x,y)))
### sample zipped X & Y pairs above with replacement
zipped_resampled = zipped_for_boot.sample(frac=1,
replace=True)
### creater your sampled X & Y
boot_x = []
boot_y = []
for sample in zipped_resampled:
boot_x.append(sample[0])
boot_y.append(sample[1])
### predict sampled using original fit
y_hat_boot_via_original_fit = func_exponential(np.asarray(boot_x),
polyfit_original[0],
polyfit_original[1])
### calculate r2 and append
r2s.append(r2_score(boot_y, y_hat_boot_via_original_fit))
### fit sampled
polyfit_boot = np.polyfit(boot_x
,np.log(boot_y)
,1
,# w= could supply weight for observations here)
)
### predict original via sampled fit or on a range of min(x) to Z
y_hat_original_via_sampled_fit = func_exponential(x,
polyfit_boot[0],
polyfit_boot[1])
### insert y hat into dataframe for calculating y hat confidence intervals
ci_df["trial_" + str(i)] = y_hat_original_via_sampled_fit
### R2 conf interval
low = round(pd.Series(r2s).quantile([0.025, 0.975]).tolist()[0],3)
up = round(pd.Series(r2s).quantile([0.025, 0.975]).tolist()[1],3)
F"r2 confidence interval = {low} - {up}"
在 Python 中,我知道如何使用 scipy.stats.pearsonr
计算 r 和关联的 p 值,但我无法找到计算 r 的置信区间的方法。这是怎么做到的?感谢您的帮助:)
使用 rpy2 和心理测量库(您需要安装 R 并首先在 R 中 运行 install.packages("psychometric")
from rpy2.robjects.packages import importr
psychometric=importr('psychometric')
psychometric.CIr(r=.9, n = 100, level = .95)
其中 0.9 是您的相关性,n 是样本量,0.95 是置信度
根据[1],直接用Pearson r计算置信区间是复杂的,因为它不是正态分布的。需要以下步骤:
- 将 r 转换为 z',
- 计算 z' 置信区间。 z'的采样分布近似正态分布,标准误差为1/sqrt(n-3)。
- 将置信区间转换回 r。
下面是一些示例代码:
def r_to_z(r):
return math.log((1 + r) / (1 - r)) / 2.0
def z_to_r(z):
e = math.exp(2 * z)
return((e - 1) / (e + 1))
def r_confidence_interval(r, alpha, n):
z = r_to_z(r)
se = 1.0 / math.sqrt(n - 3)
z_crit = stats.norm.ppf(1 - alpha/2) # 2-tailed z critical value
lo = z - z_crit * se
hi = z + z_crit * se
# Return a sequence
return (z_to_r(lo), z_to_r(hi))
参考:
bennylp给出的答案大部分是正确的,但是在第3个函数中计算临界值时有一个小错误。
应该改为:
def r_confidence_interval(r, alpha, n):
z = r_to_z(r)
se = 1.0 / math.sqrt(n - 3)
z_crit = stats.norm.ppf((1 + alpha)/2) # 2-tailed z critical value
lo = z - z_crit * se
hi = z + z_crit * se
# Return a sequence
return (z_to_r(lo), z_to_r(hi))
这里还有一个post供参考:
这是一个使用自举来计算置信区间的解决方案,而不是 Fisher 变换(假定双变量正态性等),借用 this answer:
import numpy as np
def pearsonr_ci(x, y, ci=95, n_boots=10000):
x = np.asarray(x)
y = np.asarray(y)
# (n_boots, n_observations) paired arrays
rand_ixs = np.random.randint(0, x.shape[0], size=(n_boots, x.shape[0]))
x_boots = x[rand_ixs]
y_boots = y[rand_ixs]
# differences from mean
x_mdiffs = x_boots - x_boots.mean(axis=1)[:, None]
y_mdiffs = y_boots - y_boots.mean(axis=1)[:, None]
# sums of squares
x_ss = np.einsum('ij, ij -> i', x_mdiffs, x_mdiffs)
y_ss = np.einsum('ij, ij -> i', y_mdiffs, y_mdiffs)
# pearson correlations
r_boots = np.einsum('ij, ij -> i', x_mdiffs, y_mdiffs) / np.sqrt(x_ss * y_ss)
# upper and lower bounds for confidence interval
ci_low = np.percentile(r_boots, (100 - ci) / 2)
ci_high = np.percentile(r_boots, (ci + 100) / 2)
return ci_low, ci_high
我知道上面已经建议了自举,下面建议它的另一种变体,它可能更适合其他一些设置。
#1 对您的数据进行采样(成对的 X 和 Y,也可以添加其他权重),在其上拟合原始模型,记录 r2,附加它。然后从记录的所有 R2 的分布中提取置信区间。
#2 此外可以拟合采样数据并使用采样数据模型预测非采样 X (也可以提供连续范围来扩展您的预测使用原始 X) 获得 Y 帽子的置信区间。
所以在示例代码中:
import numpy as np
from scipy.optimize import curve_fit
import pandas as pd
from sklearn.metrics import r2_score
x = np.array([your numbers here])
y = np.array([your numbers here])
### define list for R2 values
r2s = []
### define dataframe to append your bootstrapped fits for Y hat ranges
ci_df = pd.DataFrame({'x': x})
### define how many samples you want
how_many_straps = 5000
### define your fit function/s
def func_exponential(x,a,b):
return np.exp(b) * np.exp(a * x)
### fit original, using log because fitting exponential
polyfit_original = np.polyfit(x
,np.log(y)
,1
,# w= could supply weight for observations here)
)
for i in range(how_many_straps+1):
### zip into tuples attaching X to Y, can combine more variables as well
zipped_for_boot = pd.Series(tuple(zip(x,y)))
### sample zipped X & Y pairs above with replacement
zipped_resampled = zipped_for_boot.sample(frac=1,
replace=True)
### creater your sampled X & Y
boot_x = []
boot_y = []
for sample in zipped_resampled:
boot_x.append(sample[0])
boot_y.append(sample[1])
### predict sampled using original fit
y_hat_boot_via_original_fit = func_exponential(np.asarray(boot_x),
polyfit_original[0],
polyfit_original[1])
### calculate r2 and append
r2s.append(r2_score(boot_y, y_hat_boot_via_original_fit))
### fit sampled
polyfit_boot = np.polyfit(boot_x
,np.log(boot_y)
,1
,# w= could supply weight for observations here)
)
### predict original via sampled fit or on a range of min(x) to Z
y_hat_original_via_sampled_fit = func_exponential(x,
polyfit_boot[0],
polyfit_boot[1])
### insert y hat into dataframe for calculating y hat confidence intervals
ci_df["trial_" + str(i)] = y_hat_original_via_sampled_fit
### R2 conf interval
low = round(pd.Series(r2s).quantile([0.025, 0.975]).tolist()[0],3)
up = round(pd.Series(r2s).quantile([0.025, 0.975]).tolist()[1],3)
F"r2 confidence interval = {low} - {up}"