如何根据出现次数来估计传感器数据之间的相似度？

Question

以下是我的示例数据：

data = {850.0: 6, -852.0: 5, 992.0: 29, -993.0: 25, 990.0: 27, -992.0: 28, 965.0: 127, 988.0: 37, -994.0: 24, 996.0: 14, -996.0: 19, -998.0: 19, 995.0: 17, 954.0: 71, -953.0: 64, 983.0: 48, 805.0: 20, 960.0: 97, 811.0: 23, 957.0: 98, 818.0: 9, -805.0: 10, -962.0: 128, 822.0: 5, 970.0: 115, 823.0: 6, 977.0: 86, 815.0: 11, 972.0: 118, -809.0: 3, -982.0: 77, 963.0: 129, 816.0: 15, 969.0: 131, 809.0: 13, -973.0: 115, 967.0: 141, 964.0: 110, 966.0: 141, -801.0: 11, -990.0: 33, 819.0: 8, 973.0: 113, -981.0: 71, 820.0: 16, 821.0: 10, -988.0: 42, 833.0: 7, 958.0: 92, -980.0: 98, 968.0: 138, -808.0: 5, -984.0: 57, 976.0: 108, 828.0: 3, -807.0: 6, 971.0: 134, -814.0: 3, 817.0: 13, -975.0: 112, 814.0: 12, 825.0: 6, 974.0: 90, -974.0: 125, -824.0: 2, -966.0: 131, -822.0: 4, 962.0: 108, -967.0: 121, -810.0: 3, 810.0: 11, 826.0: 7, 953.0: 74, -970.0: 140, -804.0: 6, -813.0: 2, 812.0: 18, 961.0: 126, -965.0: 159, -806.0: 5, 955.0: 74, -958.0: 93, -818.0: 6, 813.0: 18, 824.0: 6, 937.0: 25, -946.0: 51, -802.0: 8, 950.0: 48, -957.0: 91, 808.0: 11, 959.0: 116, -821.0: 3, -959.0: 108, 827.0: 4, -817.0: 4, 944.0: 47, -971.0: 126, -972.0: 104, -977.0: 96, 956.0: 92, 807.0: 10, 806.0: 21, 952.0: 60, 948.0: 51, 951.0: 67, 945.0: 47, -986.0: 37, 892.0: 13, 910.0: 23, 876.0: 6, -912.0: 18, 891.0: 8, 911.0: 22, -913.0: 13, 894.0: 7, 895.0: 12, 925.0: 15, 887.0: 6, 915.0: 16, 877.0: 7, 905.0: 14, 889.0: 7, -899.0: 10, 916.0: 17, -907.0: 11, -919.0: 17, 900.0: 20, 898.0: 9, 918.0: 16, 914.0: 18, 906.0: 18, 908.0: 17, -889.0: 7, 903.0: 16, 888.0: 5, -905.0: 9, -911.0: 19, 904.0: 20, -908.0: 12, 840.0: 2, -906.0: 16, 896.0: 11, -910.0: 17, -863.0: 3, 907.0: 27, -904.0: 10, -898.0: 13, 909.0: 19, -916.0: 20, 924.0: 24, 919.0: 20, -887.0: 6, 920.0: 12, 921.0: 12, 922.0: 15, 899.0: 14, -902.0: 9, -917.0: 12, 902.0: 14, 942.0: 46, 931.0: 23, 901.0: 22, -923.0: 14, -927.0: 15, 913.0: 18, -918.0: 16, 929.0: 22, 928.0: 13, -922.0: 7, -921.0: 16, 933.0: 22, 926.0: 13, 917.0: 18, 923.0: 16, 936.0: 24, 803.0: 30, -930.0: 10, 939.0: 33, -939.0: 24, 893.0: 8, 830.0: 5, 897.0: 8, 886.0: 8, -897.0: 4, -903.0: 12, -920.0: 9, -894.0: 3, -934.0: 14, 932.0: 23, -928.0: 16, 943.0: 40, 946.0: 45, 801.0: 17, -944.0: 35, 935.0: 23, 941.0: 30, -926.0: 11, -940.0: 38, 802.0: 16, 940.0: 43, -943.0: 38, -935.0: 24, 804.0: 23, -933.0: 9, -945.0: 36, 949.0: 56, 858.0: 2, -839.0: 3, -964.0: 108, -969.0: 111, -815.0: 2, 881.0: 3, -955.0: 74, -803.0: 3, 947.0: 50, -948.0: 57, -950.0: 58, -961.0: 133, -947.0: 43, -949.0: 54, -936.0: 20, 980.0: 75, -848.0: 3, -941.0: 27, -827.0: 5, -816.0: 7, -942.0: 37, 938.0: 29, -956.0: 81, -951.0: 59, -932.0: 11, -954.0: 71, -952.0: 64, -811.0: 3, 979.0: 89, -963.0: 128, -892.0: 4, -960.0: 109, 871.0: 4, 978.0: 85, -968.0: 136, 865.0: 1, -856.0: 3, 930.0: 11, 843.0: 5, -844.0: 1, -929.0: 24, -925.0: 19, -931.0: 11, 981.0: 65, 912.0: 19, 927.0: 10, -924.0: 8, -938.0: 25, 989.0: 31, -819.0: 4, 934.0: 16, -976.0: 92, -915.0: 14, 975.0: 92, 869.0: 5, 998.0: 9, 870.0: 1, -826.0: 2, 834.0: 2, 882.0: 5, 839.0: 4, 829.0: 3, 846.0: 2, -978.0: 117, -991.0: 39, -983.0: 59, -989.0: 48, 832.0: 4, 860.0: 5, -937.0: 25, 859.0: 1, 842.0: 5, -857.0: 4, -891.0: 8, 837.0: 4, -868.0: 3, -884.0: 4, 851.0: 4, 874.0: 8, 852.0: 6, 997.0: 14, -888.0: 3, 866.0: 6, -893.0: 6, -890.0: 6, 982.0: 45, 863.0: 2, 835.0: 3, -834.0: 3, -979.0: 73, 853.0: 3, 984.0: 44, -985.0: 30, 985.0: 36, 991.0: 25, 986.0: 35, -987.0: 29, 994.0: 24, 993.0: 29, -995.0: 16, -997.0: 17, -880.0: 4, -830.0: 3, 847.0: 1, 884.0: 4, -877.0: 5, -840.0: 1, -846.0: 2, -896.0: 8, -866.0: 2, -851.0: 2, -871.0: 2, -885.0: 3, -832.0: 3, -878.0: 1, 890.0: 6, 987.0: 22, -847.0: 2, 878.0: 5, 879.0: 3, 885.0: 5, 848.0: 2, 841.0: 5, 856.0: 3, 857.0: 4, 864.0: 1, 831.0: 5, 849.0: 3, 844.0: 3, 875.0: 3, 836.0: 3, 999.0: 6, -999.0: 6, -900.0: 7, 845.0: 2, 862.0: 1, 880.0: 4, 855.0: 2, -876.0: 1, -882.0: 2, -835.0: 2, -831.0: 5, -812.0: 1, -825.0: 2, -860.0: 3, -914.0: 12, -855.0: 5, -870.0: 5, -881.0: 4, -823.0: 3, -901.0: 5, -909.0: 15, -886.0: 2, 873.0: 3, -879.0: 1, -869.0: 4, -883.0: 4, -895.0: 8, 868.0: 3, -836.0: 2, 883.0: 4, -861.0: 2, -859.0: 2, -837.0: 1, -864.0: 2, -829.0: 2, -875.0: 4, -858.0: 2, -843.0: 1, -862.0: 1, -872.0: 2, 854.0: 2, -842.0: 1, -845.0: 3, -833.0: 1, -853.0: 3, 861.0: 3, -820.0: 2, -850.0: 2, -867.0: 2, -854.0: 1, -841.0: 3, 867.0: 1, -865.0: 3, -849.0: 2, 838.0: 1, -838.0: 1, -873.0: 1}

是Python中字典的Key/Value。 Keys 是传感器数据，Values 是出现次数。我需要查找两个 Key/Value 是否匹配，如下例所示：

959.0: 116 和 -959.0: 108

这里，传感器数据959.0和-959.0分别重复（发生）116和108次。在我的系统中，我可以假设 959.0 是好的数据。但这并不总是理想的情况。传感器数据可以是 958, -955, 952, etc 及其各自的出现次数。我需要从我的数据库中找到好的传感器数据，这样每个数据都具有相似的相反值并且出现的次数接近。

我的尝试：

此刻，我正在通过绘制数据（x 是传感器数据，y 是出现的次数）并水平和垂直过滤来手动解决它。例如：

    for key in list(data.keys()):  ## Filtering sensor data based on their difference on occurance times
        if ((-1*key) in data.keys() and abs(data[key]-data[(-1*key)])<2): 
        #if (-1*key) in data.keys(): 
            pass
        else: del data[key]
        
    #print(data)    
    for key in list(data.keys()):  ##Horizontal filter (based on number of occurance)
         if data[key] >20 or abs(key)>1000:
            pass
         else: del data[key]

lists = sorted(data.items()) # sorted by key, return a list of tuples


x, y = zip(*lists) # unpack a list of pairs into two tuples

plt.plot(x, y,marker="*")
plt.grid()
plt.show()

有没有更好的统计方法来解决我在 Python 中的问题？谢谢。

Answer 1

您可以使用 pandas 数据框的 apply() 方法来计算有用的数据，以不同的精度过滤所需的传感器。在此方法中设置 axis=1 允许定义一个对每一行进行操作的函数。例如，您可以使用一种类似于您手动执行的方法：

修复传感器相似度的阈值
修复出现相似度的阈值
修复相似传感器数量的阈值 + 单个数据点必须被视为有效的出现次数

例如第一步可以这样进行：

# The data variable is the one provided in the example
# Prepare Pandas dataframe
data_dict = {"sensor": list(), "occ": list()}
for k,v in data.items():
    data_dict["sensor"].append(k)
    data_dict["occ"].append(v)
df = pd.DataFrame(data_dict)

# Add support column for filtering
df["ct"] =  pd.NaT

# Chose sensor similarity threshold
threshold = 2

# Populate the column
df["ct"] = df.apply(lambda x: get_sensor_count(x, threshold, df), axis=1)

其中函数get_sensor_count()实现如下：

# Get the count of "similar" sensors
def get_sensor_count(row, threshold, df):
    # First check if sensor hava similar value, then if they have opposite signs
    return df[(abs(abs(df["sensor"]) - abs(row["sensor"])) < threshold) & (df["sensor"] * row["sensor"] < 0)]["sensor"].count()

通过这种方式可以设置传感器相似度的阈值，得到相似传感器的个数。要过滤没有相似相反值的传感器，您可以执行以下操作：

# If at least one silimar sensor, keep it
df_good_sensors = df[df["ct"] > 0]

之后您可以在此数据集上添加任意过滤器，例如您示例中的过滤器：

# Filter occurrences
df_good_occ = df_good_sensors[(df["occ"] > 20) | (abs(df["sensor"] > 1000))]

现在您可以通过为这部分数据设置新的阈值来检查测量到类似事件的传感器有哪些：

 # Chose occurrences similarity threshold
 o_threshold = 5 
 df_good_occ["ct"] = pd.NaT
 df_good_occ["ct"] = df_good_occ.apply(lambda x: get_occ_count(x, threshold, o_threshold, df_good_occ), axis=1)

其中get_occ_count()函数实现如下：

def get_occ_count(row, s_threshold, o_threshold, df):
    # Get similar sensors using the previous sensor threshold
    to_check = df[(abs(abs(df["sensor"]) - abs(row["sensor"])) < s_threshold) & (df["sensor"] * row["sensor"] < 0)]
    # Count only the occurrences values similar to the current sensors
    return to_check[abs(to_check["occ"] - to_check["occ"]) < o_threshold]["sensor"].count()

现在，对于每个传感器，您都有具有相似出现次数的相反值的数量。作为最终过滤器，您可以设置每个最终点必须考虑多少个相似数据点：

# Chose number of similar sensors to chose how many to keep
count_threshold = 2
df_final = df_good_occ[df_good_occ["ct"] > count_threshold]

# Drop support column
df_final.drop(["ct"], axis=1)

使用这种方法，您可以设置 3 个可能的参数：

传感器阈值
发生阈值
相似数据点数

您可以混合使用这 3 个变量，看看哪个能给您带来更好的结果。要对此进行测试，您可以按照以下过程进行操作：

生成 3 个变量
使用您已经知道必须保留哪些数据点的数据集
查看已正确保存的数据点的百分比

Answer 2

如果我没理解错的话。您想比较来自传感器的这两个 time-series 数据，然后进行一些分析。

But it's not always the ideal case. The sensor data can be 958, -955, 952, etc with their respective occurrence number.

而这句话说明数据可能存在统计误差

首先绘制这些时间序列可以帮助您选择一个好的方法。

负数据显示为橙色，正数据显示为蓝色。

from scipy.signal import savgol_filter
from matplotlib import pyplot as plt
import numpy as np
import seaborn as sns

data = np.array(list(data_dict.items()), dtype=int)
positive = np.zeros((np.abs(data[:, 0]).max() + 1), dtype=int)
negative = np.zeros_like(positive)
positive[data[data[:, 0] > 0, 0]] = data[data[:, 0] > 0, 1]
negative[-data[data[:, 0] < 0, 0]] = data[data[:, 0] < 0, 1]
sns.lineplot(x=np.arange(len(positive)), y=savgol_filter(positive, 11, 3))
sns.lineplot(x=np.arange(len(positive)), y=savgol_filter(-negative, 11, 3))
plt.show()

你可以看到差异，统计误差取决于值。

我们可以尝试添加像Gaussian filter这样的过滤器，但这里我更喜欢Savgol filter。

您可以使用 scipy.

from scipy.signal import savgol_filter
savgol_filter(negative, 11, 3)

这是不同之处。

如何根据出现次数来估计传感器数据之间的相似度？

How to estimate similarity between sensor data based on the number of occurrence?

python

dictionary

dataframe

pandas

data-science