如何离散化连续函数以避免噪声产生（见图）

Question

我有一个连续输入函数，我想将其离散化为 1 和 0 之间的 5-10 个离散分箱。现在我正在使用 np.digitize 并将输出分箱重新调整为 0-1。现在的问题是有时数据集（蓝线）会产生这样的结果：

我尝试增加离散化箱的数量，但我最终保持相同的噪声并获得更多增量。作为算法使用相同设置但使用另一个数据集的示例：

这是我在那里使用的代码NumOfDisc = 垃圾箱数量

intervals = np.linspace(0,1,NumOfDisc)
discretized_Array = np.digitize(Continuous_Array, intervals)

图中的红色线并不重要。连续的蓝线是我尝试离散化的，绿线是离散化的 result.The 图是使用 matplotlyib.pyplot 使用以下代码创建的：

def CheckPlots(discretized_Array, Continuous_Array, Temperature, time, PlotName)
logging.info("Plotting...")

#Setting Axis properties and titles
fig, ax = plt.subplots(1, 1)
ax.set_title(PlotName)
ax.set_ylabel('Temperature [°C]')
ax.set_ylim(40, 110)
ax.set_xlabel('Time [s]')    
ax.grid(b=True, which="both")
ax2=ax.twinx()
ax2.set_ylabel('DC Power [%]')
ax2.set_ylim(-1.5,3.5)

#Plotting stuff
ax.plot(time, Temperature, label= "Input Temperature", color = '#c70e04')
ax2.plot(time, Continuous_Array, label= "Continuous Power", color = '#040ec7')
ax2.plot(time, discretized_Array, label= "Discrete Power", color = '#539600')

fig.legend(loc = "upper left", bbox_to_anchor=(0,1), bbox_transform=ax.transAxes)

logging.info("Done!")
logging.info("---")
return

有什么想法可以像第二种情况一样获得合理的离散化吗？

Answer 1

如果我在评论中描述的是问题所在，有几个选项可以解决这个问题：

什么都不做：根据离散化的原因，您可能希望离散值准确反映连续值
更改垃圾箱：您可以移动垃圾箱或更改垃圾箱数量 ，这样相对 'flat' 部分蓝线留在一个箱子内，因此在这些部分也给出一条平坦的绿线，这在视觉上会更令人愉悦，就像在你的第二个情节中一样。

Answer 2

以下解决方案给出了您需要的确切结果。

基本上，该算法会找到一条理想线，并尝试使用较少的数据点尽可能地复制它。它从边缘（直线）的 2 个点开始，然后在中心添加一个，然后检查哪一侧的误差最大，并在其中心添加一个点，依此类推，直到达到所需的 bin 计数.简单:)

import warnings
warnings.simplefilter('ignore', np.RankWarning)


def line_error(x0, y0, x1, y1, ideal_line, integral_points=100):
    """Assume a straight line between (x0,y0)->(x1,p1). Then sample the perfect line multiple times and compute the distance."""
    straight_line = np.poly1d(np.polyfit([x0, x1], [y0, y1], 1))
    xs = np.linspace(x0, x1, num=integral_points)
    ys = straight_line(xs)

    perfect_ys = ideal_line(xs)
    
    err = np.abs(ys - perfect_ys).sum() / integral_points * (x1 - x0)  # Remove (x1 - x0) to only look at avg errors
    return err


def discretize_bisect(xs, ys, bin_count):
    """Returns xs and ys of discrete points"""
    # For a large number of datapoints, without loss of generality you can treat xs and ys as bin edges
    # If it gives bad results, you can edges in many ways, e.g. with np.polyline or np.histogram_bin_edges
    ideal_line = np.poly1d(np.polyfit(xs, ys, 50))
    
    new_xs = [xs[0], xs[-1]]
    new_ys = [ys[0], ys[-1]]
    
    while len(new_xs) < bin_count:
        
        errors = []
        for i in range(len(new_xs)-1):
            err = line_error(new_xs[i], new_ys[i], new_xs[i+1], new_ys[i+1], ideal_line)
            errors.append(err)

        max_segment_id = np.argmax(errors)
        new_x = (new_xs[max_segment_id] + new_xs[max_segment_id+1]) / 2
        new_y = ideal_line(new_x)
        new_xs.insert(max_segment_id+1, new_x)
        new_ys.insert(max_segment_id+1, new_y)

    return new_xs, new_ys


BIN_COUNT = 25

new_xs, new_ys = discretize_bisect(xs, ys, BIN_COUNT)

plot_graph(xs, ys, new_xs, new_ys, f"Discretized and Continuous comparison, N(cont) = {N_MOCK}, N(disc) = {BIN_COUNT}")
print("Bin count:", len(new_xs))

此外，这是我测试过的简化绘图功能。

def plot_graph(cont_time, cont_array, disc_time, disc_array, plot_name):
    """A simplified version of the provided plotting function"""
    
    # Setting Axis properties and titles
    fig, ax = plt.subplots(figsize=(20, 4))
    ax.set_title(plot_name)
    ax.set_xlabel('Time [s]')
    ax.set_ylabel('DC Power [%]')

    # Plotting stuff
    ax.plot(cont_time, cont_array, label="Continuous Power", color='#0000ff')
    ax.plot(disc_time, disc_array, label="Discrete Power",   color='#00ff00')

    fig.legend(loc="upper left", bbox_to_anchor=(0,1), bbox_transform=ax.transAxes)

最后，Google Colab

如何离散化连续函数以避免噪声产生（见图）

How do I discretize a continuous function avoiding noise generation (see picture)

python

numpy

discretization