Consensus/Cluster Python 中的一组可变长度列表？

Question

我有一组传感器测量一些时间数据。在任何给定的时间步长，传感器输出 0 或 1。传感器永远不会连续输出两个 1。

根据可用的传感器，我们如何才能找到最佳估计值？

例如，假设四个传感器在这些提供的索引处输出 1。

A = [ 178,  511,  843, 1180, 1512, 1733]
B = [ 514,  846, 1182, 1515, 1736, 1937]
C = [ 182,  516,  848, 1517, 1738, 1939]
D = [ 179,  513,  845, 1181, 1513, 1735, 1936, 2124]

通过目测，我可以看到：

A 在列表尾部丢失了一个值
B 在列表头部丢失了一个值
C 在列表中间丢失了一个值
D 在列表尾部有一个额外的值

# the None locations are not known to the consensus algorithm
a = [  178,  511,  843, 1180, 1512, 1733, None]
b = [ None,  514,  846, 1182, 1515, 1736, 1937]
c = [  182,  516,  848, None, 1517, 1738, 1939]
d = [  179,  513,  845, 1181, 1513, 1735, 1936] # 2124 removed

# Consensus: Average over columns with `None` removed
# rounded to the nearest integer
s = consensus((A,B,C,D))
s = [  180,  514,  849, 1181, 1514, 1736, 1937]

如果我们有两个额外的传感器 E 和 F 具有以下值：

E = [ 2130 ]
F = [ 2121 ]
# these two sensors only have the one tail value
# therefore sensor D's extra reading is now part of consensus.
# All other values are unchanged.
s = consensus((A,B,C,D,E,F))
s = [  180,  514,  849, 1181, 1514, 1736, 1937, 2125]

有没有解决这个问题的非O(n^2)的方法？

Answer 1

感谢评论中的两位用户，他们能够引导我找到可行的解决方案。

编辑：我犹豫是否将此标记为最终答案，因为当我们将所有读数连接到一个数组中时，我们丢失了每个传感器都是唯一的信息。此外，我认为也可以使用迭代或动态编程方法，跟踪每个传感器到最近值的距离。

from matplotlib import pyplot as plt
from sklearn.neighbors import KernelDensity
from scipy.signal import find_peaks

concat = A + B + C + D
X = np.array(concat)[:, np.newaxis]

X_plot = np.linspace(0, 1.1 * X.max(), 1000)[:, np.newaxis]

kde = KernelDensity(bandwidth=2).fit(X)
log_dens = kde.score_samples(X_plot)
dens = np.exp(log_dens)
peaks, _ = find_peaks(dens)

plt.plot(X_plot[:, 0], dens)
plt.plot(X_plot[peaks], dens[peaks], "X")
plt.show()

print(tuple(int(i) for i in X_plot[peaks].squeeze()))
# (180, 514, 846, 1181, 1513, 1735, 1936, 2123)

Consensus/Cluster Python 中的一组可变长度列表？

Consensus/Cluster a set of variable length lists in Python?

python

average

cluster-analysis

list

consensus