如何计算与 Python 中性能最高的 p 值的相关性?
How to calculate a correlation with p-Values most performant in Python?
我想创建我的数据与其 p 值的相关性。目前我在 DataFrame
上使用 Pandas
及其 corr
方法。问题是这种相关方法不提供 p 值。
所以我尝试用两个答案来回答这个问题:Whosebug Question. Both solutions use the scipy.stats.pearsonr
method for the calculation. I can't use this solution (Solution 1) because it removes to much of my dataset. My next try was this one (Solution 2)。它得到了我想要的结果,但花费了大量时间。
相比之下:我的 pandas 仅关联从创建 DataFrame
到计算关联大约需要 4 秒。解决方案 2 大约需要 6 分钟才能 return 结果。我的猜测是 DataFrame
的新创建需要大量计算,因此我的数据集的时间被汇总。
有没有更高效的方法来计算这个结果? Pandas corr
也必须在后台执行此操作以处理我的 None
值,因此必须有更好的解决方案。
我的测试数据集有 500 行,每行有 550 个值。正如我所说,也有 None
个值。
解决您的问题需要数学和编程。由于 df.corr
return 对你来说很快,我将重点放在 p-value 上:
编程
scipy.stats.pearsonr(col_x, col_y)
不喜欢处理 NaN。因此,对于每一对列,您必须删除其中一个或两个元素为 NaN 的所有行。您有 550 列,因此有 550 * 549 / 2 = 150,975 对。你最好确保你的循环非常有效。
如果您查看它的 source code,DataFrame.corr
它的速度如此之快有两个原因:
- 它在 Cython 中编码,运行 在全局解释器锁 (GIL) 之外。这意味着循环在 bare-metal C 中,因此非常快
- 它实现了自己的方差算法(Welford's method),不依赖于
scipy.stats
。此算法的复杂度为 O(n * m^2)
,其中 n
是行数,m
是列数。
数学
pearsonr
文档提供了有关如何计算 p-value 的说明:
r = <Pearson correlation coefficient>
dist = scipy.stats.beta(n/2 - 1, n/2 - 1, loc=-1, scale=2)
p = 2 * dist.cdf(-abs(r))
维基百科为我们提供了 CDF of the Beta distribution 的公式:
cdf(x, alpha, beta) = B(x, alpha, beta) / B(alpha, beta)
= scipy.special.betainc(alpha, beta, x)
对我们来说幸运的是,betainc
函数是矢量化的,因此如果我们传入 3 个与参数长度相同的数组,它将 return 一个数组作为输出。
解决方案 1
此解决方案是原生的 Python 在您的数据集 (500 * 550) 上提供合理的性能。在配备 16GB 内存的 2014 iMac 上花费了大约 30 秒:
import scipy.special
def corr1(df):
mask = df.notna().to_numpy()
corr = df.corr().to_numpy()
n_rows, n_cols = df.shape
# Initialize the return arrays for better performance
length = int(n_cols * (n_cols - 1) / 2)
idx = np.empty((length, 2), dtype=object)
correl = np.empty(length, dtype=np.float64)
count = np.empty(length, dtype=np.uint64)
# For 2-combination of columns, let `n` be the number of pairs whose
# elements are all non-NaN. We will need that later to calculate the
# p-value
k = -1
for i in range(n_cols):
for j in range(i):
n = 0
for row in range(n_rows):
n += 1 if mask[row, i] and mask[row, j] else 0
k += 1
idx[k] = (i, j)
correl[k] = corr[i,j]
count[k] = n
# The p-value can be obtained with the incomplete Beta function (betainc)
# We just need to massage the input a little
alpha = count / 2 - 1
x = (correl + 1) / 2
x = np.where(correl < 0, x, 1 - x)
p = 2 * scipy.special.betainc(alpha, alpha, x)
return idx, correl, p
# Manipulate the return values into the right format
index, corr, p = corr1(df)
idx = pd.MultiIndex.from_tuples(
[(df.columns[i], df.columns[j]) for i, j in index] +
[(df.columns[j], df.columns[i]) for i, j in index]
)
full_index = pd.MultiIndex.from_product([df.columns, df.columns])
result = pd.DataFrame({
'corr': np.tile(corr, 2),
'p': np.tile(p, 2)
}, index=idx).reindex(full_index).unstack()
解决方案 2
对于绝对最快的解决方案,您必须用 Cython 编写。这将执行时间从 30 秒减少到 5 秒。我确定可以进一步优化吗(但我懒得去探索它们)。权衡是一个更复杂的构建和部署过程。
首先,确保你有一个 C 编译器。然后安装Cython包:
pip install cython
接下来,制作 2 个文件:setup.py
和 utility.pyx
:
#################################################
# setup.py
#################################################
from distutils.core import setup, Extension
from Cython.Build import cythonize
import numpy
compiler_directives = {
'language_level': '3'
}
setup(
ext_modules=cythonize("utility.pyx", compiler_directives=compiler_directives),
include_dirs=[numpy.get_include()]
)
#################################################
# utility.pyx
#################################################
import cython
from cython import Py_ssize_t
import numpy as np
from numpy cimport (
ndarray,
float64_t,
uint8_t,
uint64_t,
)
import scipy.special
@cython.boundscheck(False)
@cython.wraparound(False)
def corr2(object df):
# These variables go into the `nogil` context (i.e. into C land) so they
# must be statically typed
cdef:
Py_ssize_t n_rows, n_cols, i, j, row, n, k
ndarray[uint8_t, ndim=2] mask
ndarray[float64_t, ndim=2] corr
#
ndarray[uint64_t, ndim=2] idx
ndarray[float64_t, ndim=1] correl
ndarray[uint64_t, ndim=1] count
# We are still in Python land and thus have full access of all functions in
# numpy and pandas. Converting pandas dataframes to a 2D numpy array
# gives a huge speed boost
mask = df.notna().to_numpy(dtype='uint8')
corr = df.corr().to_numpy()
n_rows, n_cols = df.shape
# Initialize the return arrays
length = int(n_cols * (n_cols - 1) / 2)
idx = np.empty((length, 2), dtype=np.uint64)
correl = np.empty(length, dtype=np.float64)
count = np.empty(length, dtype=np.uint64)
with nogil:
# It's C-land in here. Everything must be statically defined
k = -1
# For 2-combination of columns, let `n` be the number of pairs whose
# elements are all non-NaN. We will need that later to calculate the
# p-value
for i in range(n_cols):
for j in range(i):
n = 0
for row in range(n_rows):
n += 1 if mask[row, i] and mask[row, j] else 0
k += 1
idx[k, 0] = i
idx[k, 1] = j
correl[k] = corr[i,j]
count[k] = n
# Back to Python-land
# The p-value can be obtained with the incomplete Beta function (betainc)
# We just need to massage the input a little
alpha = count / 2 - 1
x = (correl + 1) / 2 # since -1 <= correl <= 1, this makes 0 <= x <= 1
x = np.where(correl < 0, x, 1 - x) # don't ask me why. It's half-wrong and half-right without this line
p = 2 * scipy.special.betainc(alpha, alpha, x)
return idx, correl, p
接下来,将utility.pyx
编译成机器码:
python setup.py build_ext --inplace
然后你可以像使用任何其他 Python 模块一样使用 utility
:
from utility import corr2
index, corr, p = corr2(df)
idx = pd.MultiIndex.from_tuples(
[(df.columns[i], df.columns[j]) for i, j in index] +
[(df.columns[j], df.columns[i]) for i, j in index]
)
full_index = pd.MultiIndex.from_product([df.columns, df.columns])
result = pd.DataFrame({
'corr': np.tile(corr, 2),
'p': np.tile(p, 2)
}, index=idx).reindex(full_index).unstack()
我想创建我的数据与其 p 值的相关性。目前我在 DataFrame
上使用 Pandas
及其 corr
方法。问题是这种相关方法不提供 p 值。
所以我尝试用两个答案来回答这个问题:Whosebug Question. Both solutions use the scipy.stats.pearsonr
method for the calculation. I can't use this solution (Solution 1) because it removes to much of my dataset. My next try was this one (Solution 2)。它得到了我想要的结果,但花费了大量时间。
相比之下:我的 pandas 仅关联从创建 DataFrame
到计算关联大约需要 4 秒。解决方案 2 大约需要 6 分钟才能 return 结果。我的猜测是 DataFrame
的新创建需要大量计算,因此我的数据集的时间被汇总。
有没有更高效的方法来计算这个结果? Pandas corr
也必须在后台执行此操作以处理我的 None
值,因此必须有更好的解决方案。
我的测试数据集有 500 行,每行有 550 个值。正如我所说,也有 None
个值。
解决您的问题需要数学和编程。由于 df.corr
return 对你来说很快,我将重点放在 p-value 上:
编程
scipy.stats.pearsonr(col_x, col_y)
不喜欢处理 NaN。因此,对于每一对列,您必须删除其中一个或两个元素为 NaN 的所有行。您有 550 列,因此有 550 * 549 / 2 = 150,975 对。你最好确保你的循环非常有效。
如果您查看它的 source code,DataFrame.corr
它的速度如此之快有两个原因:
- 它在 Cython 中编码,运行 在全局解释器锁 (GIL) 之外。这意味着循环在 bare-metal C 中,因此非常快
- 它实现了自己的方差算法(Welford's method),不依赖于
scipy.stats
。此算法的复杂度为O(n * m^2)
,其中n
是行数,m
是列数。
数学
pearsonr
文档提供了有关如何计算 p-value 的说明:
r = <Pearson correlation coefficient>
dist = scipy.stats.beta(n/2 - 1, n/2 - 1, loc=-1, scale=2)
p = 2 * dist.cdf(-abs(r))
维基百科为我们提供了 CDF of the Beta distribution 的公式:
cdf(x, alpha, beta) = B(x, alpha, beta) / B(alpha, beta)
= scipy.special.betainc(alpha, beta, x)
对我们来说幸运的是,betainc
函数是矢量化的,因此如果我们传入 3 个与参数长度相同的数组,它将 return 一个数组作为输出。
解决方案 1
此解决方案是原生的 Python 在您的数据集 (500 * 550) 上提供合理的性能。在配备 16GB 内存的 2014 iMac 上花费了大约 30 秒:
import scipy.special
def corr1(df):
mask = df.notna().to_numpy()
corr = df.corr().to_numpy()
n_rows, n_cols = df.shape
# Initialize the return arrays for better performance
length = int(n_cols * (n_cols - 1) / 2)
idx = np.empty((length, 2), dtype=object)
correl = np.empty(length, dtype=np.float64)
count = np.empty(length, dtype=np.uint64)
# For 2-combination of columns, let `n` be the number of pairs whose
# elements are all non-NaN. We will need that later to calculate the
# p-value
k = -1
for i in range(n_cols):
for j in range(i):
n = 0
for row in range(n_rows):
n += 1 if mask[row, i] and mask[row, j] else 0
k += 1
idx[k] = (i, j)
correl[k] = corr[i,j]
count[k] = n
# The p-value can be obtained with the incomplete Beta function (betainc)
# We just need to massage the input a little
alpha = count / 2 - 1
x = (correl + 1) / 2
x = np.where(correl < 0, x, 1 - x)
p = 2 * scipy.special.betainc(alpha, alpha, x)
return idx, correl, p
# Manipulate the return values into the right format
index, corr, p = corr1(df)
idx = pd.MultiIndex.from_tuples(
[(df.columns[i], df.columns[j]) for i, j in index] +
[(df.columns[j], df.columns[i]) for i, j in index]
)
full_index = pd.MultiIndex.from_product([df.columns, df.columns])
result = pd.DataFrame({
'corr': np.tile(corr, 2),
'p': np.tile(p, 2)
}, index=idx).reindex(full_index).unstack()
解决方案 2
对于绝对最快的解决方案,您必须用 Cython 编写。这将执行时间从 30 秒减少到 5 秒。我确定可以进一步优化吗(但我懒得去探索它们)。权衡是一个更复杂的构建和部署过程。
首先,确保你有一个 C 编译器。然后安装Cython包:
pip install cython
接下来,制作 2 个文件:setup.py
和 utility.pyx
:
#################################################
# setup.py
#################################################
from distutils.core import setup, Extension
from Cython.Build import cythonize
import numpy
compiler_directives = {
'language_level': '3'
}
setup(
ext_modules=cythonize("utility.pyx", compiler_directives=compiler_directives),
include_dirs=[numpy.get_include()]
)
#################################################
# utility.pyx
#################################################
import cython
from cython import Py_ssize_t
import numpy as np
from numpy cimport (
ndarray,
float64_t,
uint8_t,
uint64_t,
)
import scipy.special
@cython.boundscheck(False)
@cython.wraparound(False)
def corr2(object df):
# These variables go into the `nogil` context (i.e. into C land) so they
# must be statically typed
cdef:
Py_ssize_t n_rows, n_cols, i, j, row, n, k
ndarray[uint8_t, ndim=2] mask
ndarray[float64_t, ndim=2] corr
#
ndarray[uint64_t, ndim=2] idx
ndarray[float64_t, ndim=1] correl
ndarray[uint64_t, ndim=1] count
# We are still in Python land and thus have full access of all functions in
# numpy and pandas. Converting pandas dataframes to a 2D numpy array
# gives a huge speed boost
mask = df.notna().to_numpy(dtype='uint8')
corr = df.corr().to_numpy()
n_rows, n_cols = df.shape
# Initialize the return arrays
length = int(n_cols * (n_cols - 1) / 2)
idx = np.empty((length, 2), dtype=np.uint64)
correl = np.empty(length, dtype=np.float64)
count = np.empty(length, dtype=np.uint64)
with nogil:
# It's C-land in here. Everything must be statically defined
k = -1
# For 2-combination of columns, let `n` be the number of pairs whose
# elements are all non-NaN. We will need that later to calculate the
# p-value
for i in range(n_cols):
for j in range(i):
n = 0
for row in range(n_rows):
n += 1 if mask[row, i] and mask[row, j] else 0
k += 1
idx[k, 0] = i
idx[k, 1] = j
correl[k] = corr[i,j]
count[k] = n
# Back to Python-land
# The p-value can be obtained with the incomplete Beta function (betainc)
# We just need to massage the input a little
alpha = count / 2 - 1
x = (correl + 1) / 2 # since -1 <= correl <= 1, this makes 0 <= x <= 1
x = np.where(correl < 0, x, 1 - x) # don't ask me why. It's half-wrong and half-right without this line
p = 2 * scipy.special.betainc(alpha, alpha, x)
return idx, correl, p
接下来,将utility.pyx
编译成机器码:
python setup.py build_ext --inplace
然后你可以像使用任何其他 Python 模块一样使用 utility
:
from utility import corr2
index, corr, p = corr2(df)
idx = pd.MultiIndex.from_tuples(
[(df.columns[i], df.columns[j]) for i, j in index] +
[(df.columns[j], df.columns[i]) for i, j in index]
)
full_index = pd.MultiIndex.from_product([df.columns, df.columns])
result = pd.DataFrame({
'corr': np.tile(corr, 2),
'p': np.tile(p, 2)
}, index=idx).reindex(full_index).unstack()