如何对包含 NaN 的大型多维数组中的每个像素应用线性回归?

How to apply linear regression to every pixel in a large multi-dimensional array containing NaNs?

我有一个独立变量值的一维数组 (x_array),它与具有多个时间步长的 3D numpy 空间数据数组 (y_array) 中的时间步长相匹配。我的实际数据要大得多:300 多个时间步长和高达 3000 * 3000 像素:

import numpy as np
from scipy.stats import linregress

# Independent variable: four time-steps of 1-dimensional data 
x_array = np.array([0.5, 0.2, 0.4, 0.4])

# Dependent variable: four time-steps of 3x3 spatial data
y_array = np.array([[[-0.2,   -0.2,   -0.3],
                     [-0.3,   -0.2,   -0.3],
                     [-0.3,   -0.4,   -0.4]],

                    [[-0.2,   -0.2,   -0.4],
                     [-0.3,   np.nan, -0.3],
                     [-0.3,   -0.3,   -0.4]],

                    [[np.nan, np.nan, -0.3],
                     [-0.2,   -0.3,   -0.7],
                     [-0.3,   -0.3,   -0.3]],

                    [[-0.1,   -0.3,   np.nan],
                     [-0.2,   -0.3,   np.nan],
                     [-0.1,   np.nan, np.nan]]])

我想计算每个像素的线性回归并获得 y_array 中每个 xy 像素的 R 平方、P 值、截距和斜率,每个时间步的值在 x_array 作为我的自变量。

我可以重塑以获取表单中的数据以将其输入到矢量化且快速的 np.polyfit 中:

# Reshape so rows = number of time-steps and columns = pixels:
y_array_reshaped = y_array.reshape(len(y_array), -1)

# Do a first-degree polyfit
np.polyfit(x_array, y_array_reshaped, 1)

然而,这会忽略包含任何 NaN 值的像素(np.polyfit 不支持 NaN 值),并且不会计算我需要的统计信息(R 平方,P -值、截距和斜率)。

answer here 使用 scipy.stats import linregress 计算我需要的统计数据,并建议通过屏蔽这些 NaN 值来避免 NaN 问题。然而,这个例子是针对两个一维数组的,我不知道如何对我的情况应用类似的屏蔽方法,在这种情况下,y_array_reshaped 中的每一列都有一组不同的 NaN 值。

我的问题: 我如何计算包含许多 NaN 值的大型多维数组 (300 x 3000 x 3000) 中每个像素的回归统计数据?快速、矢量化的方式?

期望的结果: y_array 中每个像素的回归统计值(例如 R 平方)的 3 x 3 数组,即使该像素包含 NaN 时间序列中某个点的值

我不确定这将如何扩展(也许你可以使用 dask), but here is a pretty straightforward way to do this with a pandas DataFrame using the apply 方法:

import pandas as pd
import numpy as np
from scipy.stats import linregress

# Independent variable: four time-steps of 1-dimensional data 
x_array = np.array([0.5, 0.2, 0.4, 0.4])

# Dependent variable: four time-steps of 3x3 spatial data
y_array = np.array([[[-0.2,   -0.2,   -0.3],
                     [-0.3,   -0.2,   -0.3],
                     [-0.3,   -0.4,   -0.4]],

                    [[-0.2,   -0.2,   -0.4],
                     [-0.3,   np.nan, -0.3],
                     [-0.3,   -0.3,   -0.4]],

                    [[np.nan, np.nan, -0.3],
                     [-0.2,   -0.3,   -0.7],
                     [-0.3,   -0.3,   -0.3]],

                    [[-0.1,   -0.3,   np.nan],
                     [-0.2,   -0.3,   np.nan],
                     [-0.1,   np.nan, np.nan]]])

def lin_regress(col):
    "Mask nulls and apply stats.linregress"
    col = col.loc[~pd.isnull(col)]
    return linregress(col.index.tolist(), col)

# Build the DataFrame (each index represents a pixel)
df = pd.DataFrame(y_array.reshape(len(y_array), -1), index=x_array.tolist())

# Apply a our custom linregress wrapper to each function, split the tuple into separate columns
final_df = df.apply(lin_regress).apply(pd.Series)

# Name the index and columns to make this easier to read
final_df.columns, final_df.index.name = 'slope, intercept, r_value, p_value, std_err'.split(', '), 'pixel_number'

print(final_df)

输出:

                 slope  intercept   r_value       p_value   std_err
pixel_number                                                       
0             0.071429  -0.192857  0.188982  8.789623e-01  0.371154
1            -0.071429  -0.207143 -0.188982  8.789623e-01  0.371154
2             0.357143  -0.464286  0.944911  2.122956e-01  0.123718
3             0.105263  -0.289474  0.229416  7.705843e-01  0.315789
4             1.000000  -0.700000  1.000000  9.003163e-11  0.000000
5            -0.285714  -0.328571 -0.188982  8.789623e-01  1.484615
6             0.105263  -0.289474  0.132453  8.675468e-01  0.557000
7            -0.285714  -0.228571 -0.755929  4.543711e-01  0.247436
8             0.071429  -0.392857  0.188982  8.789623e-01  0.371154

在 numpy 级别,您可以使用 np.vectorize.

首先为每组数据定义棘手的部分:

def compute(x,y):
        mask=~np.isnan(y)
        return linregress(x[mask],y[mask])

然后定义处理数据的函数:

comp = np.vectorize(compute,signature="(k),(k)->(),(),(),(),()")

然后申请,重组数据以遵循广播规则:

res = comp(x_array,rollaxis(y_array,0,3))

最后,

final=np.dstack(res) 

现在 final[i,j] 包含 linregress 为像素 (i,j) 返回的五个参数。

它大致相当于 pandas 方法的答案,但速度快 2.5 倍。
一个 300x(100x100 图像)的问题大约需要 5 秒,所以算上你的一个小时。我认为做得更好并不容易,因为时间基本上花在了 linregress 函数上。

上面评论中提到的这个博客 post 包含一个非常快速的向量化函数,用于 Python 中多维数据的互相关、协方差和回归。它产生我需要的所有回归输出,并且在几毫秒内完成,因为它完全依赖于 xarray.

中的简单向量化数组操作

https://hrishichandanpurkar.blogspot.com/2017/09/vectorized-functions-for-correlation.html

我做了一个小改动(#3 后的第一行)以确保该函数正确说明每个像素中不同数量的 NaN 值:

def lag_linregress_3D(x, y, lagx=0, lagy=0):
"""
Input: Two xr.Datarrays of any dimensions with the first dim being time. 
Thus the input data could be a 1D time series, or for example, have three 
dimensions (time,lat,lon). 
Datasets can be provided in any order, but note that the regression slope 
and intercept will be calculated for y with respect to x.
Output: Covariance, correlation, regression slope and intercept, p-value, 
and standard error on regression between the two datasets along their 
aligned time dimension.  
Lag values can be assigned to either of the data, with lagx shifting x, and
lagy shifting y, with the specified lag amount. 
""" 
#1. Ensure that the data are properly alinged to each other. 
x,y = xr.align(x,y)

#2. Add lag information if any, and shift the data accordingly
if lagx!=0:

    # If x lags y by 1, x must be shifted 1 step backwards. 
    # But as the 'zero-th' value is nonexistant, xr assigns it as invalid 
    # (nan). Hence it needs to be dropped
    x   = x.shift(time = -lagx).dropna(dim='time')

    # Next important step is to re-align the two datasets so that y adjusts
    # to the changed coordinates of x
    x,y = xr.align(x,y)

if lagy!=0:
    y   = y.shift(time = -lagy).dropna(dim='time')
    x,y = xr.align(x,y)

#3. Compute data length, mean and standard deviation along time axis: 
n = y.notnull().sum(dim='time')
xmean = x.mean(axis=0)
ymean = y.mean(axis=0)
xstd  = x.std(axis=0)
ystd  = y.std(axis=0)

#4. Compute covariance along time axis
cov   =  np.sum((x - xmean)*(y - ymean), axis=0)/(n)

#5. Compute correlation along time axis
cor   = cov/(xstd*ystd)

#6. Compute regression slope and intercept:
slope     = cov/(xstd**2)
intercept = ymean - xmean*slope  

#7. Compute P-value and standard error
#Compute t-statistics
tstats = cor*np.sqrt(n-2)/np.sqrt(1-cor**2)
stderr = slope/tstats

from scipy.stats import t
pval   = t.sf(tstats, n-2)*2
pval   = xr.DataArray(pval, dims=cor.dims, coords=cor.coords)

return cov,cor,slope,intercept,pval,stderr

此处提供的答案 https://hrishichandanpurkar.blogspot.com/2017/09/vectorized-functions-for-correlation.html 绝对不错,因为它主要利用 numpy 矢量化和广播的强大功能,但它假设要分析的数据是完整的,这通常不是真实研究周期中的案例。上面的一个答案旨在解决丢失数据的问题,但我个人认为需要更新更多代码,因为如果数据中有 nan,np.mean() 将 return nan。幸运的是,numpy 提供了 nanmean()nanstd() 等供我们忽略数据中的 nans 来计算均值、标准误差等。同时,原博客中的程序以 netCDF 格式的数据为目标。有些人可能不知道这一点,但更熟悉原始 numpy.array 格式。因此,我在这里提供一个代码示例,展示如何计算两个3维数组之间的co-variance、相关系数等(n-D维是相同的逻辑)。请注意,为了方便,我将x_array作为y_array的第一个维度的索引,但x_array在实际分析中肯定可以从外部读取。

代码

def linregress_3D(y_array):
    # y_array is a 3-D array formatted like (time,lon,lat)
    # The purpose of this function is to do linear regression using time series of data over each (lon,lat) grid box with consideration of ignoring np.nan
    # Construct x_array indicating time indexes of y_array, namely the independent variable.
    x_array=np.empty(y_array.shape)
    for i in range(y_array.shape[0]): x_array[i,:,:]=i+1 # This would be fine if time series is not too long. Or we can use i+yr (e.g. 2019).
    x_array[np.isnan(y_array)]=np.nan
    # Compute the number of non-nan over each (lon,lat) grid box.
    n=np.sum(~np.isnan(x_array),axis=0)
    # Compute mean and standard deviation of time series of x_array and y_array over each (lon,lat) grid box.
    x_mean=np.nanmean(x_array,axis=0)
    y_mean=np.nanmean(y_array,axis=0)
    x_std=np.nanstd(x_array,axis=0)
    y_std=np.nanstd(y_array,axis=0)
    # Compute co-variance between time series of x_array and y_array over each (lon,lat) grid box.
    cov=np.nansum((x_array-x_mean)*(y_array-y_mean),axis=0)/n
    # Compute correlation coefficients between time series of x_array and y_array over each (lon,lat) grid box.
    cor=cov/(x_std*y_std)
    # Compute slope between time series of x_array and y_array over each (lon,lat) grid box.
    slope=cov/(x_std**2)
    # Compute intercept between time series of x_array and y_array over each (lon,lat) grid box.
    intercept=y_mean-x_mean*slope
    # Compute tstats, stderr, and p_val between time series of x_array and y_array over each (lon,lat) grid box.
    tstats=cor*np.sqrt(n-2)/np.sqrt(1-cor**2)
    stderr=slope/tstats
    from scipy.stats import t
    p_val=t.sf(tstats,n-2)*2
    # Compute r_square and rmse between time series of x_array and y_array over each (lon,lat) grid box.
    # r_square also equals to cor**2 in 1-variable lineare regression analysis, which can be used for checking.
    r_square=np.nansum((slope*x_array+intercept-y_mean)**2,axis=0)/np.nansum((y_array-y_mean)**2,axis=0)
    rmse=np.sqrt(np.nansum((y_array-slope*x_array-intercept)**2,axis=0)/n)
    # Do further filteration if needed (e.g. We stipulate at least 3 data records are needed to do regression analysis) and return values
    n=n*1.0 # convert n from integer to float to enable later use of np.nan
    n[n<3]=np.nan
    slope[np.isnan(n)]=np.nan
    intercept[np.isnan(n)]=np.nan
    p_val[np.isnan(n)]=np.nan
    r_square[np.isnan(n)]=np.nan
    rmse[np.isnan(n)]=np.nan
    return n,slope,intercept,p_val,r_square,rmse

示例输出

我用这个程序测试了两个 227x3601x6301 像素的 3-D 阵列,它在 20 分钟内完成了工作,每个都不到 10 分钟。