计算非实数数据的增量熵

Question

我有一组数据，其中包含 ID、时间戳和标识符。我必须通过它，计算熵并为数据保存一些其他链接。在每个步骤中，更多的标识符被添加到标识符字典中，我必须重新计算熵并将其附加。我有大量的数据，由于标识符数量的增加和每一步之后的熵计算，程序卡住了。我阅读了以下解决方案，但它是关于由数字组成的数据。 Incremental entropy computation

我从这个页面复制了两个函数，熵的增量计算在每一步给出的值都与经典的全熵计算不同。这是我的代码：

from math import log
# ---------------------------------------------------------------------#
# Functions copied from  
# maps x to -x*log2(x) for x>0, and to 0 otherwise
h = lambda p: -p*log(p, 2) if p > 0 else 0

# entropy of union of two samples with entropies H1 and H2
def update(H1, S1, H2, S2):
    S = S1+S2
    return 1.0*H1*S1/S+h(1.0*S1/S)+1.0*H2*S2/S+h(1.0*S2/S)

# compute entropy using the classic equation
def entropy(L):
    n = 1.0*sum(L)
    return sum([h(x/n) for x in L])
# ---------------------------------------------------------------------#
# Below is the input data (Actually I read it from a csv file)
input_data = [["1","2008-01-06T02:13:38Z","foo,bar"], ["2","2008-01-06T02:12:13Z","bar,blup"], ["3","2008-01-06T02:13:55Z","foo,bar"],
          ["4","2008-01-06T02:12:28Z","foo,xy"], ["5","2008-01-06T02:12:44Z","foo,bar"], ["6","2008-01-06T02:13:00Z","foo,bar"],
          ["7","2008-01-06T02:13:00Z","x,y"]]
total_identifiers = {} # To store the occurrences of identifiers. Values shows the number of occurrences
all_entropies = []  # Classical way of calculating entropy at every step
updated_entropies = []  # Incremental way of calculating entropy at every step
for item in input_data:
    temp = item[2].split(",")
    identifiers_sum = sum(total_identifiers.values())  # Sum of all identifiers
    old_entropy = 0 if all_entropies[-1:] == [] else all_entropies[-1]  # Get previous entropy calculation
    for identifier in temp:
        S_new = len(temp)  # sum of new samples
        temp_dictionaty = {a:1 for a in temp}  # Store current identifiers and their occurrence
        if identifier not in total_identifiers:
            total_identifiers[identifier] = 1
        else:
            total_identifiers[identifier] += 1
    current_entropy = entropy(total_identifiers.values())  # Entropy for current set of identifiers
    updated_entropy = update(old_entropy, identifiers_sum, current_entropy, S_new)
    updated_entropies.append(updated_entropy)

    entropy_value = entropy(total_identifiers.values())  # Classical entropy calculation for comparison. This step becomes too expensive with big data
    all_entropies.append(entropy_value)

print(total_identifiers)
print('Sum of Total Identifiers: ', identifiers_sum)  # Gives 12 while the sum is 14 ???
print("All Classical Entropies:     ", all_entropies)  # print for comparison
print("All Updated Entropies:       ", updated_entropies)

另一个问题是，当我打印 "Sum of total_identifiers" 时，它给出 12 而不是 14！（由于数据量很大，我是逐行读取实际文件，结果直接写入磁盘，除了标识符字典外，没有存入内存）。

Answer 1

上面的代码使用了定理4；在我看来，您想改用定理 5（来自下一段的论文）。

但是请注意，如果标识符的数量确实是问题所在，那么下面的增量方法也不会起作用——在某些时候字典会变得太大。

您可以在下方找到遵循 Updating Formulas and Algorithms for Computing Entropy and Gini Index from Time-Changing Data Streams 描述的概念验证 Python 实现。

import collections
import math
import random


def log2(p):
    return math.log(p, 2) if p > 0 else 0


CountChange = collections.namedtuple('CountChange', ('label', 'change'))


class EntropyHolder:
    def __init__(self):
        self.counts_ = collections.defaultdict(int)

        self.entropy_ = 0
        self.sum_ = 0

    def update(self, count_changes):
        r = sum([change for _, change in count_changes])

        residual = self._compute_residual(count_changes)

        self.entropy_ = self.sum_ * (self.entropy_ - log2(self.sum_ / (self.sum_ + r))) / (self.sum_ + r) - residual

        self._update_counts(count_changes)

        return self.entropy_

    def _compute_residual(self, count_changes):
        r = sum([change for _, change in count_changes])
        residual = 0

        for label, change in count_changes:
            p_new = (self.counts_[label] + change) / (self.sum_ + r)
            p_old = self.counts_[label] / (self.sum_ + r)

            residual += p_new * log2(p_new) - p_old * log2(p_old)

        return residual

    def _update_counts(self, count_changes):
        for label, change in count_changes:
            self.sum_ += change
            self.counts_[label] += change

    def entropy(self):
        return self.entropy_



def naive_entropy(counts):
    s = sum(counts)
    return sum([-(r/s) * log2(r/s) for r in counts])


if __name__ == '__main__':
    print(naive_entropy([1, 1]))
    print(naive_entropy([1, 1, 1, 1]))

    entropy = EntropyHolder()
    freq = collections.defaultdict(int)
    for _ in range(100):
        index = random.randint(0, 5)
        entropy.update([CountChange(index, 1)])
        freq[index] += 1

    print(naive_entropy(freq.values()))
    print(entropy.entropy())

Answer 2

感谢@blazs 提供 entropy_holder class。这解决了问题。所以想法是从 (https://gist.github.com/blazs/4fc78807a96976cc455f49fc0fb28738) 导入 entropy_holder.py 并用它来存储以前的熵，并在新标识符出现时在每一步更新。

所以最小的工作代码应该是这样的：

import entropy_holder

input_data = [["1","2008-01-06T02:13:38Z","foo,bar"], ["2","2008-01-06T02:12:13Z","bar,blup"], ["3","2008-01-06T02:13:55Z","foo,bar"],
          ["4","2008-01-06T02:12:28Z","foo,xy"], ["5","2008-01-06T02:12:44Z","foo,bar"], ["6","2008-01-06T02:13:00Z","foo,bar"],
          ["7","2008-01-06T02:13:00Z","x,y"]]

entropy = entropy_holder.EntropyHolder() # This class will hold the current entropy and counts of identifiers
for item in input_data:
    for identifier in item[2].split(","):
        entropy.update([entropy_holder.CountChange(identifier, 1)])

print(entropy.entropy())

使用 Blaz 的增量公式得到的熵非常接近class逻辑方法计算出的熵，并且避免了一次又一次地迭代所有数据。

计算非实数数据的增量熵

Calculating Incremental Entropy for Data that is not real numbers

python

math

entropy

python-3.x

python-3.5