简单的基数估计算法
Simple cardinality estimation algorithm
有HyperLogLog算法,但是比较复杂
有没有更简单的space有效的方法,可以用几行代码来表达?
HyperLogLog 本身很简单,一旦你理解它(并且已经有了哈希函数)。
我在 Python 中实现了一个教学版本,其中 5 行用于元素插入,另外 6 行用于估计:
from mmhash import mmhash
from math import log
from base64 import b64encode
class HyperLogLog:
def __init__(self, log2m):
self.log2m = log2m
self.m = 1 << log2m
self.data = [0]*self.m
self.alphaMM = (0.7213 / (1 + 1.079 / self.m)) * self.m * self.m
def offer(self, o):
x = mmhash(str(o), 0)
a, b = 32-self.log2m, self.log2m
i = x >> a
v = self._bitscan(x << b, a)
self.data[i] = max(self.data[i], v)
def count(self):
estimate = self.alphaMM / sum([2**-v for v in self.data])
if estimate <= 2.5 * self.m:
zeros = float(self.data.count(0))
return round(-self.m * log(zeros / self.m))
else:
return round(estimate)
def _bitscan(self, x, m):
v = 1
while v<=m and not x&0x80000000:
v+=1
x<<=1
return v
完整的实现可以在这里找到:
https://github.com/juanplopes/hyperloglog.py/blob/master/hyperloglog.py
有HyperLogLog算法,但是比较复杂
有没有更简单的space有效的方法,可以用几行代码来表达?
HyperLogLog 本身很简单,一旦你理解它(并且已经有了哈希函数)。
我在 Python 中实现了一个教学版本,其中 5 行用于元素插入,另外 6 行用于估计:
from mmhash import mmhash
from math import log
from base64 import b64encode
class HyperLogLog:
def __init__(self, log2m):
self.log2m = log2m
self.m = 1 << log2m
self.data = [0]*self.m
self.alphaMM = (0.7213 / (1 + 1.079 / self.m)) * self.m * self.m
def offer(self, o):
x = mmhash(str(o), 0)
a, b = 32-self.log2m, self.log2m
i = x >> a
v = self._bitscan(x << b, a)
self.data[i] = max(self.data[i], v)
def count(self):
estimate = self.alphaMM / sum([2**-v for v in self.data])
if estimate <= 2.5 * self.m:
zeros = float(self.data.count(0))
return round(-self.m * log(zeros / self.m))
else:
return round(estimate)
def _bitscan(self, x, m):
v = 1
while v<=m and not x&0x80000000:
v+=1
x<<=1
return v
完整的实现可以在这里找到:
https://github.com/juanplopes/hyperloglog.py/blob/master/hyperloglog.py