如何计算与 Python 中性能最高的 p 值的相关性？

Question

我想创建我的数据与其 p 值的相关性。目前我在 DataFrame 上使用 Pandas 及其 corr 方法。问题是这种相关方法不提供 p 值。

所以我尝试用两个答案来回答这个问题：Whosebug Question. Both solutions use the scipy.stats.pearsonr method for the calculation. I can't use this solution (Solution 1) because it removes to much of my dataset. My next try was this one (Solution 2)。它得到了我想要的结果，但花费了大量时间。

相比之下：我的 pandas 仅关联从创建 DataFrame 到计算关联大约需要 4 秒。解决方案 2 大约需要 6 分钟才能 return 结果。我的猜测是 DataFrame 的新创建需要大量计算，因此我的数据集的时间被汇总。

有没有更高效的方法来计算这个结果？ Pandas corr 也必须在后台执行此操作以处理我的 None 值，因此必须有更好的解决方案。

我的测试数据集有 500 行，每行有 550 个值。正如我所说，也有 None 个值。

Answer 1

解决您的问题需要数学和编程。由于 df.corr return 对你来说很快，我将重点放在 p-value 上：

编程

scipy.stats.pearsonr(col_x, col_y) 不喜欢处理 NaN。因此，对于每一对列，您必须删除其中一个或两个元素为 NaN 的所有行。您有 550 列，因此有 550 * 549 / 2 = 150,975 对。你最好确保你的循环非常有效。

如果您查看它的 source code，DataFrame.corr 它的速度如此之快有两个原因：

它在 Cython 中编码，运行在全局解释器锁 (GIL) 之外。这意味着循环在 bare-metal C 中，因此非常快
它实现了自己的方差算法（Welford's method），不依赖于scipy.stats。此算法的复杂度为 O(n * m^2)，其中 n 是行数，m 是列数。

数学

pearsonr 文档提供了有关如何计算 p-value 的说明：

r = <Pearson correlation coefficient>
dist = scipy.stats.beta(n/2 - 1, n/2 - 1, loc=-1, scale=2)
p = 2 * dist.cdf(-abs(r))

维基百科为我们提供了 CDF of the Beta distribution 的公式：

cdf(x, alpha, beta) = B(x, alpha, beta) / B(alpha, beta)
                    = scipy.special.betainc(alpha, beta, x)

对我们来说幸运的是，betainc 函数是矢量化的，因此如果我们传入 3 个与参数长度相同的数组，它将 return 一个数组作为输出。

解决方案 1

此解决方案是原生的 Python 在您的数据集 (500 * 550) 上提供合理的性能。在配备 16GB 内存的 2014 iMac 上花费了大约 30 秒：

import scipy.special

def corr1(df):
    mask = df.notna().to_numpy()
    corr = df.corr().to_numpy()
    n_rows, n_cols = df.shape

    # Initialize the return arrays for better performance
    length = int(n_cols * (n_cols - 1) / 2)
    idx = np.empty((length, 2), dtype=object)
    correl = np.empty(length, dtype=np.float64)
    count = np.empty(length, dtype=np.uint64)

    # For 2-combination of columns, let `n` be the number of pairs whose
    # elements are all non-NaN. We will need that later to calculate the
    # p-value
    k = -1
    for i in range(n_cols):
        for j in range(i):
            n = 0
            for row in range(n_rows):
                n += 1 if mask[row, i] and mask[row, j] else 0

            k += 1
            idx[k] = (i, j)
            correl[k] = corr[i,j]
            count[k] = n

    # The p-value can be obtained with the incomplete Beta function (betainc)
    # We just need to massage the input a little
    alpha = count / 2 - 1
    x = (correl + 1) / 2
    x = np.where(correl < 0, x, 1 - x)
    p = 2 * scipy.special.betainc(alpha, alpha, x)
    
    return idx, correl, p

# Manipulate the return values into the right format
index, corr, p = corr1(df)

idx = pd.MultiIndex.from_tuples(
    [(df.columns[i], df.columns[j]) for i, j in index] +
    [(df.columns[j], df.columns[i]) for i, j in index]
)
full_index = pd.MultiIndex.from_product([df.columns, df.columns])
result = pd.DataFrame({
    'corr': np.tile(corr, 2),
    'p': np.tile(p, 2)
}, index=idx).reindex(full_index).unstack()

解决方案 2

对于绝对最快的解决方案，您必须用 Cython 编写。这将执行时间从 30 秒减少到 5 秒。我确定可以进一步优化吗（但我懒得去探索它们）。权衡是一个更复杂的构建和部署过程。

首先，确保你有一个 C 编译器。然后安装Cython包：

pip install cython

接下来，制作 2 个文件：setup.py 和 utility.pyx：

#################################################
# setup.py
#################################################
from distutils.core import setup, Extension
from Cython.Build import cythonize
import numpy

compiler_directives = {
    'language_level': '3'
}
setup(
    ext_modules=cythonize("utility.pyx", compiler_directives=compiler_directives),
    include_dirs=[numpy.get_include()]
)

#################################################
# utility.pyx
#################################################
import cython
from cython import Py_ssize_t

import numpy as np
from numpy cimport (
    ndarray,
    float64_t,
    uint8_t,
    uint64_t,
)

import scipy.special

@cython.boundscheck(False)
@cython.wraparound(False)
def corr2(object df):
    # These variables go into the `nogil` context (i.e. into C land) so they
    # must be statically typed
    cdef:
        Py_ssize_t n_rows, n_cols, i, j, row, n, k
        ndarray[uint8_t, ndim=2] mask
        ndarray[float64_t, ndim=2] corr
        #
        ndarray[uint64_t, ndim=2] idx
        ndarray[float64_t, ndim=1] correl
        ndarray[uint64_t, ndim=1] count

    # We are still in Python land and thus have full access of all functions in
    # numpy and pandas. Converting pandas dataframes to a 2D numpy array
    # gives a huge speed boost
    mask = df.notna().to_numpy(dtype='uint8')
    corr = df.corr().to_numpy()
    n_rows, n_cols = df.shape

    # Initialize the return arrays
    length = int(n_cols * (n_cols - 1) / 2)
    idx = np.empty((length, 2), dtype=np.uint64)
    correl = np.empty(length, dtype=np.float64)
    count = np.empty(length, dtype=np.uint64)

    with nogil:
        # It's C-land in here. Everything must be statically defined

        k = -1
        # For 2-combination of columns, let `n` be the number of pairs whose
        # elements are all non-NaN. We will need that later to calculate the
        # p-value
        for i in range(n_cols):
            for j in range(i):
                n = 0
                for row in range(n_rows):
                    n += 1 if mask[row, i] and mask[row, j] else 0
            
                k += 1
                idx[k, 0] = i
                idx[k, 1] = j
                correl[k] = corr[i,j]
                count[k] = n
    
    # Back to Python-land
    # The p-value can be obtained with the incomplete Beta function (betainc)
    # We just need to massage the input a little
    alpha = count / 2 - 1
    x = (correl + 1) / 2                # since -1 <= correl <= 1, this makes 0 <= x <= 1
    x = np.where(correl < 0, x, 1 - x)  # don't ask me why. It's half-wrong and half-right without this line
    p = 2 * scipy.special.betainc(alpha, alpha, x)
    return idx, correl, p

接下来，将utility.pyx编译成机器码：

python setup.py build_ext --inplace

然后你可以像使用任何其他 Python 模块一样使用 utility:

from utility import corr2

index, corr, p = corr2(df)

idx = pd.MultiIndex.from_tuples(
    [(df.columns[i], df.columns[j]) for i, j in index] +
    [(df.columns[j], df.columns[i]) for i, j in index]
)
full_index = pd.MultiIndex.from_product([df.columns, df.columns])
result = pd.DataFrame({
    'corr': np.tile(corr, 2),
    'p': np.tile(p, 2)
}, index=idx).reindex(full_index).unstack()

如何计算与 Python 中性能最高的 p 值的相关性？

How to calculate a correlation with p-Values most performant in Python?

python

correlation

pandas

编程

数学

解决方案 1

解决方案 2