Consensus/Cluster Python 中的一组可变长度列表?

Consensus/Cluster a set of variable length lists in Python?

我有一组传感器测量一些时间数据。在任何给定的时间步长,传感器输出 01。传感器永远不会连续输出两个 1

根据可用的传感器,我们如何才能找到最佳估计值?

例如,假设四个传感器在这些提供的索引处输出 1

A = [ 178,  511,  843, 1180, 1512, 1733]
B = [ 514,  846, 1182, 1515, 1736, 1937]
C = [ 182,  516,  848, 1517, 1738, 1939]
D = [ 179,  513,  845, 1181, 1513, 1735, 1936, 2124]

通过目测,我可以看到:

# the None locations are not known to the consensus algorithm
a = [  178,  511,  843, 1180, 1512, 1733, None]
b = [ None,  514,  846, 1182, 1515, 1736, 1937]
c = [  182,  516,  848, None, 1517, 1738, 1939]
d = [  179,  513,  845, 1181, 1513, 1735, 1936] # 2124 removed

# Consensus: Average over columns with `None` removed
# rounded to the nearest integer
s = consensus((A,B,C,D))
s = [  180,  514,  849, 1181, 1514, 1736, 1937]

如果我们有两个额外的传感器 EF 具有以下值:

E = [ 2130 ]
F = [ 2121 ]
# these two sensors only have the one tail value
# therefore sensor D's extra reading is now part of consensus.
# All other values are unchanged.
s = consensus((A,B,C,D,E,F))
s = [  180,  514,  849, 1181, 1514, 1736, 1937, 2125]

有没有解决这个问题的非O(n^2)的方法?

感谢评论中的两位用户,他们能够引导我找到可行的解决方案。

编辑:我犹豫是否将此标记为最终答案,因为当我们将所有读数连接到一个数组中时,我们丢失了每个传感器都是唯一的信息。此外,我认为也可以使用迭代或动态编程方法,跟踪每个传感器到最近值的距离。

from matplotlib import pyplot as plt
from sklearn.neighbors import KernelDensity
from scipy.signal import find_peaks

concat = A + B + C + D
X = np.array(concat)[:, np.newaxis]

X_plot = np.linspace(0, 1.1 * X.max(), 1000)[:, np.newaxis]

kde = KernelDensity(bandwidth=2).fit(X)
log_dens = kde.score_samples(X_plot)
dens = np.exp(log_dens)
peaks, _ = find_peaks(dens)

plt.plot(X_plot[:, 0], dens)
plt.plot(X_plot[peaks], dens[peaks], "X")
plt.show()

print(tuple(int(i) for i in X_plot[peaks].squeeze()))
# (180, 514, 846, 1181, 1513, 1735, 1936, 2123)