Python基线校正库

Question

我目前正在处理一些拉曼光谱数据，我正在尝试更正由荧光偏斜引起的数据。请看下图：

我非常接近实现我想要的。如您所见，我正在尝试在所有数据中拟合多项式，而实际上我应该只在局部最小值处拟合多项式。

理想情况下，我想要一个多项式拟合，当从我的原始数据中减去它时，会得到如下结果：

是否已经有任何内置库可以执行此操作？

如果没有，有什么简单的算法可以推荐给我吗？

Answer 1

我找到了问题的答案，分享给所有偶然发现这个问题的人。

P. Eilers 和 H. Boelens 在 2005 年提出了一个名为 "Asymmetric Least Squares Smoothing" 的算法。该论文是免费的，您可以在 google 上找到它。

def baseline_als(y, lam, p, niter=10):
  L = len(y)
  D = sparse.csc_matrix(np.diff(np.eye(L), 2))
  w = np.ones(L)
  for i in xrange(niter):
    W = sparse.spdiags(w, 0, L, L)
    Z = W + lam * D.dot(D.transpose())
    z = spsolve(Z, w*y)
    w = p * (y > z) + (1-p) * (y < z)
  return z

Answer 2

我知道这是一个老问题，但几个月前我偶然发现了它并使用 spicy.sparse 例程实现了等效的答案。

# Baseline removal                                                                                            

def baseline_als(y, lam, p, niter=10):                                                                        

    s  = len(y)                                                                                               
    # assemble difference matrix                                                                              
    D0 = sparse.eye( s )                                                                                      
    d1 = [numpy.ones( s-1 ) * -2]                                                                             
    D1 = sparse.diags( d1, [-1] )                                                                             
    d2 = [ numpy.ones( s-2 ) * 1]                                                                             
    D2 = sparse.diags( d2, [-2] )                                                                             

    D  = D0 + D2 + D1                                                                                         
    w  = np.ones( s )                                                                                         
    for i in range( niter ):                                                                                  
        W = sparse.diags( [w], [0] )                                                                          
        Z =  W + lam*D.dot( D.transpose() )                                                                   
        z = spsolve( Z, w*y )                                                                                 
        w = p * (y > z) + (1-p) * (y < z)                                                                     

    return z

干杯，

佩德罗

Answer 3

以下代码适用于 Python 3.6.

这是根据已接受的正确答案改编的，以避免密集矩阵 diff 计算（这很容易导致内存问题）并使用 range（不是 xrange）

import numpy as np
from scipy import sparse
from scipy.sparse.linalg import spsolve

def baseline_als(y, lam, p, niter=10):
  L = len(y)
  D = sparse.diags([1,-2,1],[0,-1,-2], shape=(L,L-2))
  w = np.ones(L)
  for i in range(niter):
    W = sparse.spdiags(w, 0, L, L)
    Z = W + lam * D.dot(D.transpose())
    z = spsolve(Z, w*y)
    w = p * (y > z) + (1-p) * (y < z)
  return z

Answer 4

最近需要用到这个方法。答案中的代码运行良好，但显然过度使用了内存。所以，这是我的优化内存使用的版本。

def baseline_als_optimized(y, lam, p, niter=10):
    L = len(y)
    D = sparse.diags([1,-2,1],[0,-1,-2], shape=(L,L-2))
    D = lam * D.dot(D.transpose()) # Precompute this term since it does not depend on `w`
    w = np.ones(L)
    W = sparse.spdiags(w, 0, L, L)
    for i in range(niter):
        W.setdiag(w) # Do not create a new matrix, just update diagonal values
        Z = W + D
        z = spsolve(Z, w*y)
        w = p * (y > z) + (1-p) * (y < z)
    return z

根据我下面的基准测试，它也快了大约 1.5 倍。

%%timeit -n 1000 -r 10 y = randn(1000)
baseline_als(y, 10000, 0.05) # function from @jpantina's answer
# 20.5 ms ± 382 µs per loop (mean ± std. dev. of 10 runs, 1000 loops each)

%%timeit -n 1000 -r 10 y = randn(1000)
baseline_als_optimized(y, 10000, 0.05)
# 13.3 ms ± 874 µs per loop (mean ± std. dev. of 10 runs, 1000 loops each)

注1：原文说：

To emphasize the basic simplicity of the algorithm, the number of iterations has been fixed to 10. In practical applications one should check whether the weights show any change; if not, convergence has been attained.

所以，这意味着停止迭代更正确的方法是检查 ||w_new - w|| < tolerance

注意 2： 另一个有用的引用（来自@glycoaddict 的评论）给出了如何选择参数值的想法。

There are two parameters: p for asymmetry and λ for smoothness. Both have to be tuned to the data at hand. We found that generally 0.001 ≤ p ≤ 0.1 is a good choice (for a signal with positive peaks) and 102 ≤ λ ≤ 109, but exceptions may occur. In any case one should vary λ on a grid that is approximately linear for log λ. Often visual inspection is sufficient to get good parameter values.

Answer 5

有一个 python 库可用于基线 correction/removal。它具有 Modpoly、IModploy 和 Zhang fit 算法，当您输入原始值作为 python 列表或 pandas 系列并指定多项式次数时，可以 return 基线校正结果。

将库安装为 pip install BaselineRemoval。下面是一个例子

from BaselineRemoval import BaselineRemoval

input_array=[10,20,1.5,5,2,9,99,25,47]
polynomial_degree=2 #only needed for Modpoly and IModPoly algorithm

baseObj=BaselineRemoval(input_array)
Modpoly_output=baseObj.ModPoly(polynomial_degree)
Imodpoly_output=baseObj.IModPoly(polynomial_degree)
Zhangfit_output=baseObj.ZhangFit()

print('Original input:',input_array)
print('Modpoly base corrected values:',Modpoly_output)
print('IModPoly base corrected values:',Imodpoly_output)
print('ZhangFit base corrected values:',Zhangfit_output)

Original input: [10, 20, 1.5, 5, 2, 9, 99, 25, 47]
Modpoly base corrected values: [-1.98455800e-04  1.61793368e+01  1.08455179e+00  5.21544654e+00
  7.20210508e-02  2.15427531e+00  8.44622093e+01 -4.17691125e-03
  8.75511661e+00]
IModPoly base corrected values: [-0.84912125 15.13786196 -0.11351367  3.89675187 -1.33134142  0.70220645
 82.99739548 -1.44577432  7.37269705]
ZhangFit base corrected values: [ 8.49924691e+00  1.84994576e+01 -3.31739230e-04  3.49854060e+00
  4.97412948e-01  7.49628529e+00  9.74951576e+01  2.34940300e+01
  4.54929023e+01

Answer 6

我使用 glinka in a previous comment, which is an improvement of the penalized weighted linear squares method published in a relatively recent paper. I took Rustam Guliev's 代码引用的算法版本来构建这个：

from scipy import sparse
from scipy.sparse import linalg
import numpy as np
from numpy.linalg import norm


def baseline_arPLS(y, ratio=1e-6, lam=100, niter=10, full_output=False):
    L = len(y)

    diag = np.ones(L - 2)
    D = sparse.spdiags([diag, -2*diag, diag], [0, -1, -2], L, L - 2)

    H = lam * D.dot(D.T)  # The transposes are flipped w.r.t the Algorithm on pg. 252

    w = np.ones(L)
    W = sparse.spdiags(w, 0, L, L)

    crit = 1
    count = 0

    while crit > ratio:
        z = linalg.spsolve(W + H, W * y)
        d = y - z
        dn = d[d < 0]

        m = np.mean(dn)
        s = np.std(dn)

        w_new = 1 / (1 + np.exp(2 * (d - (2*s - m))/s))

        crit = norm(w_new - w) / norm(w)

        w = w_new
        W.setdiag(w)  # Do not create a new matrix, just update diagonal values

        count += 1

        if count > niter:
            print('Maximum number of iterations exceeded')
            break

    if full_output:
        info = {'num_iter': count, 'stop_criterion': crit}
        return z, d, info
    else:
        return z

为了测试算法，我创建了一个类似于论文图 3 所示的光谱，首先生成一个由多个高斯峰组成的模拟光谱：

def spectra_model(x):
    coeff = np.array([100, 200, 100])
    mean = np.array([300, 750, 800])

    stdv = np.array([15, 30, 15])

    terms = []
    for ind in range(len(coeff)):
        term = coeff[ind] * np.exp(-((x - mean[ind]) / stdv[ind])**2)
        terms.append(term)

    spectra = sum(terms)

    return spectra

x_vals = np.arange(1, 1001)
spectra_sim = spectra_model(x_vals)

然后，我使用直接从论文中获取的 4 个点创建了一个三阶插值多项式：

from scipy.interpolate import CubicSpline
x_poly = np.array([0, 250, 700, 1000])
y_poly = np.array([200, 180, 230, 200])

poly = CubicSpline(x_poly, y_poly)
baseline = poly(x_vals)

noise = np.random.randn(len(x_vals)) * 0.1
spectra_base = spectra_sim + baseline + noise

最后，我使用基线校正算法从改变的光谱中减去基线 (spectra_base):

 _, spectra_arPLS, info = baseline_arPLS(spectra_base, lam=1e4, niter=10,
                                         full_output=True)

结果是（作为参考，我与使用 lam = 1e4 和 p = 0.001 的 Rustam Guliev's 的纯 ALS 实施进行了比较）：

Python基线校正库

Python baseline correction library

python

signal-processing

numpy

scipy