Python基线校正库
Python baseline correction library
我目前正在处理一些拉曼光谱数据,我正在尝试更正由荧光偏斜引起的数据。请看下图:
我非常接近实现我想要的。如您所见,我正在尝试在所有数据中拟合多项式,而实际上我应该只在局部最小值处拟合多项式。
理想情况下,我想要一个多项式拟合,当从我的原始数据中减去它时,会得到如下结果:
是否已经有任何内置库可以执行此操作?
如果没有,有什么简单的算法可以推荐给我吗?
我找到了问题的答案,分享给所有偶然发现这个问题的人。
P. Eilers 和 H. Boelens 在 2005 年提出了一个名为 "Asymmetric Least Squares Smoothing" 的算法。该论文是免费的,您可以在 google 上找到它。
def baseline_als(y, lam, p, niter=10):
L = len(y)
D = sparse.csc_matrix(np.diff(np.eye(L), 2))
w = np.ones(L)
for i in xrange(niter):
W = sparse.spdiags(w, 0, L, L)
Z = W + lam * D.dot(D.transpose())
z = spsolve(Z, w*y)
w = p * (y > z) + (1-p) * (y < z)
return z
我知道这是一个老问题,但几个月前我偶然发现了它并使用 spicy.sparse 例程实现了等效的答案。
# Baseline removal
def baseline_als(y, lam, p, niter=10):
s = len(y)
# assemble difference matrix
D0 = sparse.eye( s )
d1 = [numpy.ones( s-1 ) * -2]
D1 = sparse.diags( d1, [-1] )
d2 = [ numpy.ones( s-2 ) * 1]
D2 = sparse.diags( d2, [-2] )
D = D0 + D2 + D1
w = np.ones( s )
for i in range( niter ):
W = sparse.diags( [w], [0] )
Z = W + lam*D.dot( D.transpose() )
z = spsolve( Z, w*y )
w = p * (y > z) + (1-p) * (y < z)
return z
干杯,
佩德罗
以下代码适用于 Python 3.6.
这是根据已接受的正确答案改编的,以避免密集矩阵 diff
计算(这很容易导致内存问题)并使用 range
(不是 xrange
)
import numpy as np
from scipy import sparse
from scipy.sparse.linalg import spsolve
def baseline_als(y, lam, p, niter=10):
L = len(y)
D = sparse.diags([1,-2,1],[0,-1,-2], shape=(L,L-2))
w = np.ones(L)
for i in range(niter):
W = sparse.spdiags(w, 0, L, L)
Z = W + lam * D.dot(D.transpose())
z = spsolve(Z, w*y)
w = p * (y > z) + (1-p) * (y < z)
return z
最近需要用到这个方法。答案中的代码运行良好,但显然过度使用了内存。所以,这是我的优化内存使用的版本。
def baseline_als_optimized(y, lam, p, niter=10):
L = len(y)
D = sparse.diags([1,-2,1],[0,-1,-2], shape=(L,L-2))
D = lam * D.dot(D.transpose()) # Precompute this term since it does not depend on `w`
w = np.ones(L)
W = sparse.spdiags(w, 0, L, L)
for i in range(niter):
W.setdiag(w) # Do not create a new matrix, just update diagonal values
Z = W + D
z = spsolve(Z, w*y)
w = p * (y > z) + (1-p) * (y < z)
return z
根据我下面的基准测试,它也快了大约 1.5 倍。
%%timeit -n 1000 -r 10 y = randn(1000)
baseline_als(y, 10000, 0.05) # function from @jpantina's answer
# 20.5 ms ± 382 µs per loop (mean ± std. dev. of 10 runs, 1000 loops each)
%%timeit -n 1000 -r 10 y = randn(1000)
baseline_als_optimized(y, 10000, 0.05)
# 13.3 ms ± 874 µs per loop (mean ± std. dev. of 10 runs, 1000 loops each)
注1:原文说:
To emphasize the basic simplicity of the algorithm, the number of iterations has been fixed to 10. In practical applications one should check whether the weights show any change; if not, convergence has been attained.
所以,这意味着停止迭代更正确的方法是检查 ||w_new - w|| < tolerance
注意 2: 另一个有用的引用(来自@glycoaddict 的评论)给出了如何选择参数值的想法。
There are two parameters: p for asymmetry and λ for smoothness. Both have to be
tuned to the data at hand. We found that generally 0.001 ≤ p ≤ 0.1 is a good choice (for a signal with positive peaks) and 102 ≤ λ ≤ 109, but exceptions may occur. In any case one should vary λ on a grid that is approximately linear for log λ. Often visual inspection is sufficient to get good parameter values.
有一个 python 库可用于基线 correction/removal。它具有 Modpoly、IModploy 和 Zhang fit 算法,当您输入原始值作为 python 列表或 pandas 系列并指定多项式次数时,可以 return 基线校正结果。
将库安装为 pip install BaselineRemoval
。下面是一个例子
from BaselineRemoval import BaselineRemoval
input_array=[10,20,1.5,5,2,9,99,25,47]
polynomial_degree=2 #only needed for Modpoly and IModPoly algorithm
baseObj=BaselineRemoval(input_array)
Modpoly_output=baseObj.ModPoly(polynomial_degree)
Imodpoly_output=baseObj.IModPoly(polynomial_degree)
Zhangfit_output=baseObj.ZhangFit()
print('Original input:',input_array)
print('Modpoly base corrected values:',Modpoly_output)
print('IModPoly base corrected values:',Imodpoly_output)
print('ZhangFit base corrected values:',Zhangfit_output)
Original input: [10, 20, 1.5, 5, 2, 9, 99, 25, 47]
Modpoly base corrected values: [-1.98455800e-04 1.61793368e+01 1.08455179e+00 5.21544654e+00
7.20210508e-02 2.15427531e+00 8.44622093e+01 -4.17691125e-03
8.75511661e+00]
IModPoly base corrected values: [-0.84912125 15.13786196 -0.11351367 3.89675187 -1.33134142 0.70220645
82.99739548 -1.44577432 7.37269705]
ZhangFit base corrected values: [ 8.49924691e+00 1.84994576e+01 -3.31739230e-04 3.49854060e+00
4.97412948e-01 7.49628529e+00 9.74951576e+01 2.34940300e+01
4.54929023e+01
我使用 glinka in a previous comment, which is an improvement of the penalized weighted linear squares method published in a relatively recent paper. I took Rustam Guliev's 代码引用的算法版本来构建这个:
from scipy import sparse
from scipy.sparse import linalg
import numpy as np
from numpy.linalg import norm
def baseline_arPLS(y, ratio=1e-6, lam=100, niter=10, full_output=False):
L = len(y)
diag = np.ones(L - 2)
D = sparse.spdiags([diag, -2*diag, diag], [0, -1, -2], L, L - 2)
H = lam * D.dot(D.T) # The transposes are flipped w.r.t the Algorithm on pg. 252
w = np.ones(L)
W = sparse.spdiags(w, 0, L, L)
crit = 1
count = 0
while crit > ratio:
z = linalg.spsolve(W + H, W * y)
d = y - z
dn = d[d < 0]
m = np.mean(dn)
s = np.std(dn)
w_new = 1 / (1 + np.exp(2 * (d - (2*s - m))/s))
crit = norm(w_new - w) / norm(w)
w = w_new
W.setdiag(w) # Do not create a new matrix, just update diagonal values
count += 1
if count > niter:
print('Maximum number of iterations exceeded')
break
if full_output:
info = {'num_iter': count, 'stop_criterion': crit}
return z, d, info
else:
return z
为了测试算法,我创建了一个类似于论文图 3 所示的光谱,首先生成一个由多个高斯峰组成的模拟光谱:
def spectra_model(x):
coeff = np.array([100, 200, 100])
mean = np.array([300, 750, 800])
stdv = np.array([15, 30, 15])
terms = []
for ind in range(len(coeff)):
term = coeff[ind] * np.exp(-((x - mean[ind]) / stdv[ind])**2)
terms.append(term)
spectra = sum(terms)
return spectra
x_vals = np.arange(1, 1001)
spectra_sim = spectra_model(x_vals)
然后,我使用直接从论文中获取的 4 个点创建了一个三阶插值多项式:
from scipy.interpolate import CubicSpline
x_poly = np.array([0, 250, 700, 1000])
y_poly = np.array([200, 180, 230, 200])
poly = CubicSpline(x_poly, y_poly)
baseline = poly(x_vals)
noise = np.random.randn(len(x_vals)) * 0.1
spectra_base = spectra_sim + baseline + noise
最后,我使用基线校正算法从改变的光谱中减去基线 (spectra_base
):
_, spectra_arPLS, info = baseline_arPLS(spectra_base, lam=1e4, niter=10,
full_output=True)
结果是(作为参考,我与使用 lam = 1e4
和 p = 0.001
的 Rustam Guliev's 的纯 ALS 实施进行了比较):
我目前正在处理一些拉曼光谱数据,我正在尝试更正由荧光偏斜引起的数据。请看下图:
我非常接近实现我想要的。如您所见,我正在尝试在所有数据中拟合多项式,而实际上我应该只在局部最小值处拟合多项式。
理想情况下,我想要一个多项式拟合,当从我的原始数据中减去它时,会得到如下结果:
是否已经有任何内置库可以执行此操作?
如果没有,有什么简单的算法可以推荐给我吗?
我找到了问题的答案,分享给所有偶然发现这个问题的人。
P. Eilers 和 H. Boelens 在 2005 年提出了一个名为 "Asymmetric Least Squares Smoothing" 的算法。该论文是免费的,您可以在 google 上找到它。
def baseline_als(y, lam, p, niter=10):
L = len(y)
D = sparse.csc_matrix(np.diff(np.eye(L), 2))
w = np.ones(L)
for i in xrange(niter):
W = sparse.spdiags(w, 0, L, L)
Z = W + lam * D.dot(D.transpose())
z = spsolve(Z, w*y)
w = p * (y > z) + (1-p) * (y < z)
return z
我知道这是一个老问题,但几个月前我偶然发现了它并使用 spicy.sparse 例程实现了等效的答案。
# Baseline removal
def baseline_als(y, lam, p, niter=10):
s = len(y)
# assemble difference matrix
D0 = sparse.eye( s )
d1 = [numpy.ones( s-1 ) * -2]
D1 = sparse.diags( d1, [-1] )
d2 = [ numpy.ones( s-2 ) * 1]
D2 = sparse.diags( d2, [-2] )
D = D0 + D2 + D1
w = np.ones( s )
for i in range( niter ):
W = sparse.diags( [w], [0] )
Z = W + lam*D.dot( D.transpose() )
z = spsolve( Z, w*y )
w = p * (y > z) + (1-p) * (y < z)
return z
干杯,
佩德罗
以下代码适用于 Python 3.6.
这是根据已接受的正确答案改编的,以避免密集矩阵 diff
计算(这很容易导致内存问题)并使用 range
(不是 xrange
)
import numpy as np
from scipy import sparse
from scipy.sparse.linalg import spsolve
def baseline_als(y, lam, p, niter=10):
L = len(y)
D = sparse.diags([1,-2,1],[0,-1,-2], shape=(L,L-2))
w = np.ones(L)
for i in range(niter):
W = sparse.spdiags(w, 0, L, L)
Z = W + lam * D.dot(D.transpose())
z = spsolve(Z, w*y)
w = p * (y > z) + (1-p) * (y < z)
return z
最近需要用到这个方法。答案中的代码运行良好,但显然过度使用了内存。所以,这是我的优化内存使用的版本。
def baseline_als_optimized(y, lam, p, niter=10):
L = len(y)
D = sparse.diags([1,-2,1],[0,-1,-2], shape=(L,L-2))
D = lam * D.dot(D.transpose()) # Precompute this term since it does not depend on `w`
w = np.ones(L)
W = sparse.spdiags(w, 0, L, L)
for i in range(niter):
W.setdiag(w) # Do not create a new matrix, just update diagonal values
Z = W + D
z = spsolve(Z, w*y)
w = p * (y > z) + (1-p) * (y < z)
return z
根据我下面的基准测试,它也快了大约 1.5 倍。
%%timeit -n 1000 -r 10 y = randn(1000)
baseline_als(y, 10000, 0.05) # function from @jpantina's answer
# 20.5 ms ± 382 µs per loop (mean ± std. dev. of 10 runs, 1000 loops each)
%%timeit -n 1000 -r 10 y = randn(1000)
baseline_als_optimized(y, 10000, 0.05)
# 13.3 ms ± 874 µs per loop (mean ± std. dev. of 10 runs, 1000 loops each)
注1:原文说:
To emphasize the basic simplicity of the algorithm, the number of iterations has been fixed to 10. In practical applications one should check whether the weights show any change; if not, convergence has been attained.
所以,这意味着停止迭代更正确的方法是检查 ||w_new - w|| < tolerance
注意 2: 另一个有用的引用(来自@glycoaddict 的评论)给出了如何选择参数值的想法。
There are two parameters: p for asymmetry and λ for smoothness. Both have to be tuned to the data at hand. We found that generally 0.001 ≤ p ≤ 0.1 is a good choice (for a signal with positive peaks) and 102 ≤ λ ≤ 109, but exceptions may occur. In any case one should vary λ on a grid that is approximately linear for log λ. Often visual inspection is sufficient to get good parameter values.
有一个 python 库可用于基线 correction/removal。它具有 Modpoly、IModploy 和 Zhang fit 算法,当您输入原始值作为 python 列表或 pandas 系列并指定多项式次数时,可以 return 基线校正结果。
将库安装为 pip install BaselineRemoval
。下面是一个例子
from BaselineRemoval import BaselineRemoval
input_array=[10,20,1.5,5,2,9,99,25,47]
polynomial_degree=2 #only needed for Modpoly and IModPoly algorithm
baseObj=BaselineRemoval(input_array)
Modpoly_output=baseObj.ModPoly(polynomial_degree)
Imodpoly_output=baseObj.IModPoly(polynomial_degree)
Zhangfit_output=baseObj.ZhangFit()
print('Original input:',input_array)
print('Modpoly base corrected values:',Modpoly_output)
print('IModPoly base corrected values:',Imodpoly_output)
print('ZhangFit base corrected values:',Zhangfit_output)
Original input: [10, 20, 1.5, 5, 2, 9, 99, 25, 47]
Modpoly base corrected values: [-1.98455800e-04 1.61793368e+01 1.08455179e+00 5.21544654e+00
7.20210508e-02 2.15427531e+00 8.44622093e+01 -4.17691125e-03
8.75511661e+00]
IModPoly base corrected values: [-0.84912125 15.13786196 -0.11351367 3.89675187 -1.33134142 0.70220645
82.99739548 -1.44577432 7.37269705]
ZhangFit base corrected values: [ 8.49924691e+00 1.84994576e+01 -3.31739230e-04 3.49854060e+00
4.97412948e-01 7.49628529e+00 9.74951576e+01 2.34940300e+01
4.54929023e+01
我使用 glinka in a previous comment, which is an improvement of the penalized weighted linear squares method published in a relatively recent paper. I took Rustam Guliev's 代码引用的算法版本来构建这个:
from scipy import sparse
from scipy.sparse import linalg
import numpy as np
from numpy.linalg import norm
def baseline_arPLS(y, ratio=1e-6, lam=100, niter=10, full_output=False):
L = len(y)
diag = np.ones(L - 2)
D = sparse.spdiags([diag, -2*diag, diag], [0, -1, -2], L, L - 2)
H = lam * D.dot(D.T) # The transposes are flipped w.r.t the Algorithm on pg. 252
w = np.ones(L)
W = sparse.spdiags(w, 0, L, L)
crit = 1
count = 0
while crit > ratio:
z = linalg.spsolve(W + H, W * y)
d = y - z
dn = d[d < 0]
m = np.mean(dn)
s = np.std(dn)
w_new = 1 / (1 + np.exp(2 * (d - (2*s - m))/s))
crit = norm(w_new - w) / norm(w)
w = w_new
W.setdiag(w) # Do not create a new matrix, just update diagonal values
count += 1
if count > niter:
print('Maximum number of iterations exceeded')
break
if full_output:
info = {'num_iter': count, 'stop_criterion': crit}
return z, d, info
else:
return z
为了测试算法,我创建了一个类似于论文图 3 所示的光谱,首先生成一个由多个高斯峰组成的模拟光谱:
def spectra_model(x):
coeff = np.array([100, 200, 100])
mean = np.array([300, 750, 800])
stdv = np.array([15, 30, 15])
terms = []
for ind in range(len(coeff)):
term = coeff[ind] * np.exp(-((x - mean[ind]) / stdv[ind])**2)
terms.append(term)
spectra = sum(terms)
return spectra
x_vals = np.arange(1, 1001)
spectra_sim = spectra_model(x_vals)
然后,我使用直接从论文中获取的 4 个点创建了一个三阶插值多项式:
from scipy.interpolate import CubicSpline
x_poly = np.array([0, 250, 700, 1000])
y_poly = np.array([200, 180, 230, 200])
poly = CubicSpline(x_poly, y_poly)
baseline = poly(x_vals)
noise = np.random.randn(len(x_vals)) * 0.1
spectra_base = spectra_sim + baseline + noise
最后,我使用基线校正算法从改变的光谱中减去基线 (spectra_base
):
_, spectra_arPLS, info = baseline_arPLS(spectra_base, lam=1e4, niter=10,
full_output=True)
结果是(作为参考,我与使用 lam = 1e4
和 p = 0.001
的 Rustam Guliev's 的纯 ALS 实施进行了比较):