scipy find_peaks 函数在倒置数据集上使用时出现问题
Issues with scipy find_peaks function when used on an inverted dataset
下面的脚本混合了不同主题的 Whosebug 答案,但与寻找信号峰值密切相关。如前所述 here 根据突出度查找峰值效果非常好,但我的问题是我需要在峰值之后立即找到最低点。该数据集是在连续 14 小时内捕获的植物荧光信号,峰值是用于确定光照条件下饱和度的饱和脉冲。下面是数据集的图片(一个 68MB CSV 文件):
这是我的 python 脚本:
import pandas as pd
import numpy as np
from datetime import datetime
import matplotlib.pyplot as plt
from scipy.signal import find_peaks
# A parser is required to translate the timestamp
custom_date_parser = lambda x: datetime.strptime(x, "%d-%m-%Y %H:%M_%S.%f")
df = pd.read_csv('15-01-2022_05_00.csv', parse_dates=[ 'Timestamp'], date_parser=custom_date_parser)
x = df['Timestamp']
y = df['Mean_values']
# As per accepted answer here:
#
peaks, _ = find_peaks(y, prominence=1)
# Invert the data to find the lowest points of peaks as per answer here:
#
valleys, _ = find_peaks(-y, prominence=1)
print(y[peaks])
print(y[valleys])
plt.subplot(2, 1, 1)
plt.plot(peaks, y[peaks], "ob"); plt.plot(y); plt.legend(['Prominence'])
plt.subplot(2, 1, 2)
plt.plot(valleys, y[valleys], "vg"); plt.plot(y); plt.legend(['Prominence Inverted'])
plt.show()
如图所示,并非所有 'prominence inverted' 点都低于各自的峰值。 prominence inverted 函数来自这个 post ,它只是简单地反转数据集。有些与前一个峰值相邻(在图片中很难看到)。下面的高峰和低谷:
Peaks
1817 109.587178
3674 89.191393
56783 72.779385
111593 77.868118
166403 83.288949
221213 84.955026
276023 84.340550
330833 83.186605
385643 81.134827
440453 79.060960
495264 77.457803
550074 76.292243
604884 75.867575
659694 75.511924
714504 74.221657
769314 73.830941
824125 76.977637
878935 78.826169
933745 77.819844
988555 77.298089
1043365 77.188105
1098175 75.340765
1152985 74.311185
1207796 73.163844
1262606 72.613317
1317416 73.460068
1372226 70.388324
1427036 70.835355
1481845 70.154085
Valleys
2521 4.669368
56629 26.551585
56998 26.184984
111791 26.288734
166620 27.717165
221434 28.312708
330432 28.235397
385617 27.535091
440341 26.886589
495174 26.379043
549353 26.040947
550239 25.760023
605051 25.594147
714352 25.354300
714653 25.008184
769472 24.883584
824284 25.135316
879075 25.477464
933907 25.265173
988711 25.160046
1097917 25.058851
1098333 24.626667
1153134 24.357835
1207943 23.982878
1262750 23.938298
1371013 23.766077
1372381 23.351263
1427187 23.368314
对山谷中的这种尴尬结果有什么想法吗?
您试图找到所有的山谷使您的任务复杂化。这总是很困难,因为与周围的数据相比,它们并不像您的峰值那样突出。无论 find_peaks
的参数如何,有时它会在峰值后识别出两个谷,有时 none。相反,只需确定每个峰值后的局部最小值:
import pandas as pd
import matplotlib.pyplot as plt
from scipy.signal import find_peaks
#sample data
from scipy.misc import electrocardiogram
x = electrocardiogram()[2000:4000]
date_range = pd.date_range("20210116", periods=x.size, freq="10ms")
df = pd.DataFrame({"Timestamp": date_range, "Mean_values": x})
x = df['Timestamp']
y = df['Mean_values']
fig, (ax1, ax2, ax3) = plt.subplots(3, figsize=(12, 8))
#peak finding
peaks, _ = find_peaks(y, prominence=1)
ax1.plot(x[peaks], y[peaks], "ob")
ax1.plot(x, y)
ax1.legend(['Prominence'])
#valley finder general
valleys, _ = find_peaks(-y, prominence=1)
ax2.plot(x[valleys], y[valleys], "vg")
ax2.plot(x, y)
ax2.legend(['Valleys without filtering'])
#valley finding restricted to a short time period after a peak
#set time window, e.g., for 200 ms
time_window_size = pd.Timedelta(200, unit="ms")
time_of_peaks = x[peaks]
peak_end = x.searchsorted(time_of_peaks + time_window_size)
#in case of evenly spaced data points, this can be simplified
#and you just add n data points to your peak index array
#peak_end = peaks + n
true_valleys = peaks.copy()
for i, (start, stop) in enumerate(zip(peaks, peak_end)):
true_valleys[i] = start + y[start:stop].argmin()
ax3.plot(x[true_valleys], y[true_valleys], "sr")
ax3.plot(x, y)
ax3.legend(['Valleys after events'])
plt.show()
示例输出:
我不确定你打算用这些最小值做什么,但如果你只对基线偏移感兴趣,你可以直接计算峰值基线值,比如
baseline_per_peak = peaks.copy().astype(float)
for i, (start, stop) in enumerate(zip(peaks, peak_end)):
baseline_per_peak[i] = y[start:stop].mean()
print(baseline_per_peak)
示例输出:
[-0.71125 -0.203 0.29225 0.72825 0.6835 0.79125 0.51225 0.23
0.0345 -0.3945 -0.48125 -0.4675 ]
这当然也可以很容易地适应高峰前的时期:
#valley in the short time period before a peak
#set time window, e.g., for 200 ms
time_window_size = pd.Timedelta(200, unit="ms")
time_of_peaks = x[peaks]
peak_start = x.searchsorted(time_of_peaks - time_window_size)
#in case of evenly spaced data points, this can be simplified
#and you just add n data points to your peak index array
#peak_start = peaks - n
true_valleys = peaks.copy()
for i, (start, stop) in enumerate(zip(peak_start, peaks)):
true_valleys[i] = start + y[start:stop].argmin()
下面的脚本混合了不同主题的 Whosebug 答案,但与寻找信号峰值密切相关。如前所述 here 根据突出度查找峰值效果非常好,但我的问题是我需要在峰值之后立即找到最低点。该数据集是在连续 14 小时内捕获的植物荧光信号,峰值是用于确定光照条件下饱和度的饱和脉冲。下面是数据集的图片(一个 68MB CSV 文件):
这是我的 python 脚本:
import pandas as pd
import numpy as np
from datetime import datetime
import matplotlib.pyplot as plt
from scipy.signal import find_peaks
# A parser is required to translate the timestamp
custom_date_parser = lambda x: datetime.strptime(x, "%d-%m-%Y %H:%M_%S.%f")
df = pd.read_csv('15-01-2022_05_00.csv', parse_dates=[ 'Timestamp'], date_parser=custom_date_parser)
x = df['Timestamp']
y = df['Mean_values']
# As per accepted answer here:
#
peaks, _ = find_peaks(y, prominence=1)
# Invert the data to find the lowest points of peaks as per answer here:
#
valleys, _ = find_peaks(-y, prominence=1)
print(y[peaks])
print(y[valleys])
plt.subplot(2, 1, 1)
plt.plot(peaks, y[peaks], "ob"); plt.plot(y); plt.legend(['Prominence'])
plt.subplot(2, 1, 2)
plt.plot(valleys, y[valleys], "vg"); plt.plot(y); plt.legend(['Prominence Inverted'])
plt.show()
如图所示,并非所有 'prominence inverted' 点都低于各自的峰值。 prominence inverted 函数来自这个 post
Peaks
1817 109.587178
3674 89.191393
56783 72.779385
111593 77.868118
166403 83.288949
221213 84.955026
276023 84.340550
330833 83.186605
385643 81.134827
440453 79.060960
495264 77.457803
550074 76.292243
604884 75.867575
659694 75.511924
714504 74.221657
769314 73.830941
824125 76.977637
878935 78.826169
933745 77.819844
988555 77.298089
1043365 77.188105
1098175 75.340765
1152985 74.311185
1207796 73.163844
1262606 72.613317
1317416 73.460068
1372226 70.388324
1427036 70.835355
1481845 70.154085
Valleys
2521 4.669368
56629 26.551585
56998 26.184984
111791 26.288734
166620 27.717165
221434 28.312708
330432 28.235397
385617 27.535091
440341 26.886589
495174 26.379043
549353 26.040947
550239 25.760023
605051 25.594147
714352 25.354300
714653 25.008184
769472 24.883584
824284 25.135316
879075 25.477464
933907 25.265173
988711 25.160046
1097917 25.058851
1098333 24.626667
1153134 24.357835
1207943 23.982878
1262750 23.938298
1371013 23.766077
1372381 23.351263
1427187 23.368314
对山谷中的这种尴尬结果有什么想法吗?
您试图找到所有的山谷使您的任务复杂化。这总是很困难,因为与周围的数据相比,它们并不像您的峰值那样突出。无论 find_peaks
的参数如何,有时它会在峰值后识别出两个谷,有时 none。相反,只需确定每个峰值后的局部最小值:
import pandas as pd
import matplotlib.pyplot as plt
from scipy.signal import find_peaks
#sample data
from scipy.misc import electrocardiogram
x = electrocardiogram()[2000:4000]
date_range = pd.date_range("20210116", periods=x.size, freq="10ms")
df = pd.DataFrame({"Timestamp": date_range, "Mean_values": x})
x = df['Timestamp']
y = df['Mean_values']
fig, (ax1, ax2, ax3) = plt.subplots(3, figsize=(12, 8))
#peak finding
peaks, _ = find_peaks(y, prominence=1)
ax1.plot(x[peaks], y[peaks], "ob")
ax1.plot(x, y)
ax1.legend(['Prominence'])
#valley finder general
valleys, _ = find_peaks(-y, prominence=1)
ax2.plot(x[valleys], y[valleys], "vg")
ax2.plot(x, y)
ax2.legend(['Valleys without filtering'])
#valley finding restricted to a short time period after a peak
#set time window, e.g., for 200 ms
time_window_size = pd.Timedelta(200, unit="ms")
time_of_peaks = x[peaks]
peak_end = x.searchsorted(time_of_peaks + time_window_size)
#in case of evenly spaced data points, this can be simplified
#and you just add n data points to your peak index array
#peak_end = peaks + n
true_valleys = peaks.copy()
for i, (start, stop) in enumerate(zip(peaks, peak_end)):
true_valleys[i] = start + y[start:stop].argmin()
ax3.plot(x[true_valleys], y[true_valleys], "sr")
ax3.plot(x, y)
ax3.legend(['Valleys after events'])
plt.show()
示例输出:
我不确定你打算用这些最小值做什么,但如果你只对基线偏移感兴趣,你可以直接计算峰值基线值,比如
baseline_per_peak = peaks.copy().astype(float)
for i, (start, stop) in enumerate(zip(peaks, peak_end)):
baseline_per_peak[i] = y[start:stop].mean()
print(baseline_per_peak)
示例输出:
[-0.71125 -0.203 0.29225 0.72825 0.6835 0.79125 0.51225 0.23
0.0345 -0.3945 -0.48125 -0.4675 ]
这当然也可以很容易地适应高峰前的时期:
#valley in the short time period before a peak
#set time window, e.g., for 200 ms
time_window_size = pd.Timedelta(200, unit="ms")
time_of_peaks = x[peaks]
peak_start = x.searchsorted(time_of_peaks - time_window_size)
#in case of evenly spaced data points, this can be simplified
#and you just add n data points to your peak index array
#peak_start = peaks - n
true_valleys = peaks.copy()
for i, (start, stop) in enumerate(zip(peak_start, peaks)):
true_valleys[i] = start + y[start:stop].argmin()