在某些情况下,将单个高斯拟合到 'noisy' 数据会产生较差的拟合

Fitting a single gaussian to 'noisy' data yields a poor fit in some cases

我有一些可以包含 0 和 n 高斯形状的噪声数据,我正在尝试实现一种算法,该算法采用最高数据点并根据以下 'scheme' 拟合高斯:

新的尝试,步骤:

  1. 通过所有数据点拟合样条
  2. 获取样条函数的一阶导数
  3. 获取两个数据点 (left/right) 其中 f'(x) = 0 左右具有最大强度的数据点
  4. 通过从 3

    返回的数据点拟合高斯分布

    4a。在 pdf

  5. 中绘制高斯分布(停在基线处)
  6. 计算高斯曲线下面积
  7. 计算原始数据点下的面积
  8. 计算高斯面积占总面积的百分比

我已经使用以下代码(最小工作示例)实现了这个概念:

#! /usr/bin/env python
from scipy.interpolate import InterpolatedUnivariateSpline
from scipy.optimize import curve_fit
from scipy.signal import argrelextrema
import numpy as np
import matplotlib
import matplotlib.pyplot as plt

data = [(9.60380153195,187214),(9.62028167623,181023),(9.63676350256,174588),(9.65324602212,169389),(9.66972824591,166921),(9.68621215187,167597),(9.70269675106,170838),(9.71918105436,175816),(9.73566703995,181552),(9.75215371878,186978),(9.76864010158,191718),(9.78512816681,194473),(9.80161692526,194169),(9.81810538757,191203),(9.83459553243,186603),(9.85108637051,180273),(9.86757691233,171996),(9.88406913682,163653),(9.90056205454,156032),(9.91705467586,149928),(9.93354897998,145410),(9.95004397733,141818),(9.96653867816,139042),(9.98303506191,137546),(9.99953213889,138724)]
data2 = [(9.60476933166,163571),(9.62125990879,156662),(9.63775225872,150535),(9.65424539203,146960),(9.67073831905,146794),(9.68723301904,149326),(9.70372850238,152616),(9.72022377931,155420),(9.73672082933,156151),(9.75321866271,154633),(9.76971628954,151549),(9.78621568961,148298),(9.80271587303,146333),(9.81921584976,146734),(9.83571759987,150351),(9.85222013334,156612),(9.86872245996,164192),(9.88522656011,171199),(9.90173144362,175697),(9.91823612015,176867),(9.93474257034,175029),(9.95124980389,171762),(9.96775683032,168449),(9.98426563055,165026)]

def gaussFunction(x, *p):
    """ TODO
    """
    A, mu, sigma = p
    return A*np.exp(-(x-mu)**2/(2.*sigma**2))

def quantify(data):
    """ TODO
    """
    backGround = 105000  # Normally this is dynamically determined but this value is fine for testing on the provided data
    time,intensity = zip(*data)
    x_data = np.array(time)
    y_data = np.array(intensity)
    newX = np.linspace(x_data[0], x_data[-1], 2500*(x_data[-1]-x_data[0]))
    f = InterpolatedUnivariateSpline(x_data, y_data)
    fPrime = f.derivative()
    newY = f(newX)
    newPrimeY = fPrime(newX)
    maxm = argrelextrema(newPrimeY, np.greater)
    minm = argrelextrema(newPrimeY, np.less)
    breaks = maxm[0].tolist() + minm[0].tolist()
    maxPoint = 0
    for index,j in enumerate(breaks):
        try:
            if max(newY[breaks[index]:breaks[index+1]]) > maxPoint:
                maxPoint = max(newY[breaks[index]:breaks[index+1]])
                xData = newX[breaks[index]:breaks[index+1]]
                yData = [x - backGround for x in newY[breaks[index]:breaks[index+1]]]
        except:
            pass
    # Gaussian fit on main points
    newGaussX = np.linspace(x_data[0], x_data[-1], 2500*(x_data[-1]-x_data[0]))
    p0 = [np.max(yData), xData[np.argmax(yData)],0.1]
    try:
        coeff, var_matrix = curve_fit(gaussFunction, xData, yData, p0)
        newGaussY = gaussFunction(newGaussX, *coeff)
        newGaussY = [x + backGround for x in newGaussY]


        # Generate plot for visual confirmation
        fig = plt.figure()

        ax = fig.add_subplot(111)
        plt.plot(x_data, y_data, 'b*')

        plt.plot((newX[0],newX[-1]),(backGround,backGround),'red')
        plt.plot(newX,newY, color='blue',linestyle='dashed')
        plt.plot(newGaussX, newGaussY, color='green',linestyle='dashed')
        plt.title("Test")
        plt.xlabel("rt [m]")
        plt.ylabel("intensity [au]")
        plt.savefig("Test.pdf",bbox_inches="tight")
        plt.close(fig)
    except:
        pass

# Call the test
#quantify(data)
quantify(data2)

通常背景(下图中的红线)是动态确定的,但为了这个例子,我将其设置为固定数字。我遇到的问题是,对于某些数据,它的效果非常好:

对应f'(x):

然而,对于其他一些数据,它非常失败:

对应f'(x):

因此,我想听听一些关于为什么会发生这种情况以及可能的解决方法的建议或想法。我已经包含了下图中显示的数据(以防有人想尝试):

错误在于以下位:

breaks = maxm[0].tolist() + minm[0].tolist()
for index,j in enumerate(breaks):

breaks 列表现在包含最大值和最小值,但它们未按时间排序。导致列表产生以下不适合的数据点:9.78、9.62 和 9.86。

然后程序将检查从 9.78 到 9.62 和 9.62 到 9.86 的数据,这意味着 9.62 到 9.86 包含最高强度数据点,产生第二张图中显示的拟合。

修复非常简单,只需在中间的中断处添加一个 sort,如下所示:

breaks = maxm[0].tolist() + minm[0].tolist()
breaks = sorted(breaks)
for index,j in enumerate(breaks):

然后该程序产生了更符合我预期的拟合结果: