可预测性上限
upper bound on predictability
我正在尝试计算占用数据集的可预测性上限,如 Song 的 'Limits of Predictability in Human Mobility' 论文中所述。基本上,在家(=1)和不在家(=0)就代表了 Song 论文中访问过的地点(塔)。
我在一个随机二进制序列上测试了我的代码(我从 https://github.com/gavin-s-smith/MobilityPredictabilityUpperBounds and https://github.com/gavin-s-smith/EntropyRateEst 派生的),该序列应该 return 熵为 1,可预测性为 0.5。相反,returned 熵为 0.87,可预测性为 0.71。
这是我的代码:
import numpy as np
from scipy.optimize import fsolve
from cmath import log
import math
def matchfinder(data):
data_len = len(data)
output = np.zeros(len(data))
output[0] = 1
# Using L_{n} definition from
#"Nonparametric Entropy Estimation for Stationary Process and Random Fields, with Applications to English Text"
# by Kontoyiannis et. al.
# $L_{n} = 1 + max \{l :0 \leq l \leq n, X^{l-1}_{0} = X^{-j+l-1}_{-j} \text{ for some } l \leq j \leq n \}$
# for each position, i, in the sub-sequence that occurs before the current position, start_idx
# check to see the maximum continuously equal string we can make by simultaneously extending from i and start_idx
for start_idx in range(1,data_len):
max_subsequence_matched = 0
for i in range(0,start_idx):
# for( int i = 0; i < start_idx; i++ )
# {
j = 0
#increase the length of the substring starting at j and start_idx
#while they are the same keeping track of the length
while( (start_idx+j < data_len) and (i+j < start_idx) and (data[i+j] == data[start_idx+j]) ):
j = j + 1
if j > max_subsequence_matched:
max_subsequence_matched = j;
#L_{n} is obtained by adding 1 to the longest match-length
output[start_idx] = max_subsequence_matched + 1;
return output
if __name__ == '__main__':
#Read dataset
data = np.random.randint(2,size=2000)
#Number of distinct locations
N = len(np.unique(data))
#True entropy
lambdai = matchfinder(data)
Etrue = math.pow(sum( [ lambdai[i] / math.log(i+1,2) for i in range(1,len(data))] ) * (1.0/len(data)),-1)
S = Etrue
#use Fano's inequality to compute the predictability
func = lambda x: (-(x*log(x,2).real+(1-x)*log(1-x,2).real)+(1-x)*log(N-1,2).real ) - S
ub = fsolve(func, 0.9)[0]
print ub
matchfinder 函数通过查找最长的匹配项来找到熵并将其加 1(= 以前未见过的最短子字符串)。然后使用 Fano 不等式计算可预测性。
可能是什么问题?
谢谢!
熵函数好像有误。
参考论文 Song, C., Qu, Z., Blumm, N., & Barabási, A. L. (2010)。人类流动性的可预测性限制。 Science, 327(5968), 1018–1021. 你提到了真实熵是通过基于Lempel-Ziv数据压缩的算法估计的:
在代码中它看起来像这样:
Etrue = math.pow((np.sum(lambdai)/ n),-1)*log(n,2).real
其中 n 是时间序列的长度。
请注意,我们使用的对数底数与给定公式中的底数不同。然而,由于 Fano 不等式中对数的底数是 2,因此使用相同的底数计算熵似乎是合乎逻辑的。另外,我不确定你为什么从第一个而不是零索引开始求和。
所以现在将其包装到函数中,例如:
def solve(locations, size):
data = np.random.randint(locations,size=size)
N = len(np.unique(data))
n = float(len(data))
print "Distinct locations: %i" % N
print "Time series length: %i" % n
#True entropy
lambdai = matchfinder(data)
#S = math.pow(sum([lambdai[i] / math.log(i + 1, 2) for i in range(1, len(data))]) * (1.0 / len(data)), -1)
Etrue = math.pow((np.sum(lambdai)/ n),-1)*log(n,2).real
S = Etrue
print "Maximum entropy: %2.5f" % log(locations,2).real
print "Real entropy: %2.5f" % S
func = lambda x: (-(x * log(x, 2).real + (1 - x) * log(1 - x, 2).real) + (1 - x) * log(N - 1, 2).real) - S
ub = fsolve(func, 0.9)[0]
print "Upper bound of predictability: %2.5f" % ub
return ub
2 个位置的输出
Distinct locations: 2
Time series length: 10000
Maximum entropy: 1.00000
Real entropy: 1.01441
Upper bound of predictability: 0.50013
3 个位置的输出
Distinct locations: 3
Time series length: 10000
Maximum entropy: 1.58496
Real entropy: 1.56567
Upper bound of predictability: 0.41172
Lempel-Ziv 压缩在 n 接近无穷大时收敛于真实熵,这就是为什么对于 2 个位置的情况它略高于最大限制。
我也不确定您是否正确解释了 lambda 的定义。它被定义为 "the length of the shortest substring starting at position i which dosen't previously appear from position 1 to i-1",所以当我们到达某个点时,进一步的子字符串不再是唯一的,你的匹配算法会给它的长度总是比子字符串的长度大一个,而它应该等于 0,因为不存在唯一子字符串。
为了更清楚地说明,我们举一个简单的例子。如果位置数组如下所示:
[1 0 0 1 0 0]
然后我们可以看到前三个位置模式再次重复。这意味着从第四个位置开始,最短的唯一子字符串 不存在 ,因此它等于 0。因此输出 (lambda) 应如下所示:
[1 1 2 0 0 0]
但是,您在这种情况下的功能将 return:
[1 1 2 4 3 2]
我重写了你的匹配函数来解决这个问题:
def matchfinder2(data):
data_len = len(data)
output = np.zeros(len(data))
output[0] = 1
for start_idx in range(1,data_len):
max_subsequence_matched = 0
for i in range(0,start_idx):
j = 0
end_distance = data_len - start_idx #length left to the end of sequence (including current index)
while( (start_idx+j < data_len) and (i+j < start_idx) and (data[i+j] == data[start_idx+j]) ):
j = j + 1
if j == end_distance: #check if j has reached the end of sequence
output[start_idx::] = np.zeros(end_distance) #if yes fill the rest of output with zeros
return output #end function
elif j > max_subsequence_matched:
max_subsequence_matched = j;
output[start_idx] = max_subsequence_matched + 1;
return output
差异当然很小,因为结果仅针对序列的一小部分发生变化。
我正在尝试计算占用数据集的可预测性上限,如 Song 的 'Limits of Predictability in Human Mobility' 论文中所述。基本上,在家(=1)和不在家(=0)就代表了 Song 论文中访问过的地点(塔)。
我在一个随机二进制序列上测试了我的代码(我从 https://github.com/gavin-s-smith/MobilityPredictabilityUpperBounds and https://github.com/gavin-s-smith/EntropyRateEst 派生的),该序列应该 return 熵为 1,可预测性为 0.5。相反,returned 熵为 0.87,可预测性为 0.71。
这是我的代码:
import numpy as np
from scipy.optimize import fsolve
from cmath import log
import math
def matchfinder(data):
data_len = len(data)
output = np.zeros(len(data))
output[0] = 1
# Using L_{n} definition from
#"Nonparametric Entropy Estimation for Stationary Process and Random Fields, with Applications to English Text"
# by Kontoyiannis et. al.
# $L_{n} = 1 + max \{l :0 \leq l \leq n, X^{l-1}_{0} = X^{-j+l-1}_{-j} \text{ for some } l \leq j \leq n \}$
# for each position, i, in the sub-sequence that occurs before the current position, start_idx
# check to see the maximum continuously equal string we can make by simultaneously extending from i and start_idx
for start_idx in range(1,data_len):
max_subsequence_matched = 0
for i in range(0,start_idx):
# for( int i = 0; i < start_idx; i++ )
# {
j = 0
#increase the length of the substring starting at j and start_idx
#while they are the same keeping track of the length
while( (start_idx+j < data_len) and (i+j < start_idx) and (data[i+j] == data[start_idx+j]) ):
j = j + 1
if j > max_subsequence_matched:
max_subsequence_matched = j;
#L_{n} is obtained by adding 1 to the longest match-length
output[start_idx] = max_subsequence_matched + 1;
return output
if __name__ == '__main__':
#Read dataset
data = np.random.randint(2,size=2000)
#Number of distinct locations
N = len(np.unique(data))
#True entropy
lambdai = matchfinder(data)
Etrue = math.pow(sum( [ lambdai[i] / math.log(i+1,2) for i in range(1,len(data))] ) * (1.0/len(data)),-1)
S = Etrue
#use Fano's inequality to compute the predictability
func = lambda x: (-(x*log(x,2).real+(1-x)*log(1-x,2).real)+(1-x)*log(N-1,2).real ) - S
ub = fsolve(func, 0.9)[0]
print ub
matchfinder 函数通过查找最长的匹配项来找到熵并将其加 1(= 以前未见过的最短子字符串)。然后使用 Fano 不等式计算可预测性。
可能是什么问题?
谢谢!
熵函数好像有误。 参考论文 Song, C., Qu, Z., Blumm, N., & Barabási, A. L. (2010)。人类流动性的可预测性限制。 Science, 327(5968), 1018–1021. 你提到了真实熵是通过基于Lempel-Ziv数据压缩的算法估计的:
在代码中它看起来像这样:
Etrue = math.pow((np.sum(lambdai)/ n),-1)*log(n,2).real
其中 n 是时间序列的长度。
请注意,我们使用的对数底数与给定公式中的底数不同。然而,由于 Fano 不等式中对数的底数是 2,因此使用相同的底数计算熵似乎是合乎逻辑的。另外,我不确定你为什么从第一个而不是零索引开始求和。
所以现在将其包装到函数中,例如:
def solve(locations, size):
data = np.random.randint(locations,size=size)
N = len(np.unique(data))
n = float(len(data))
print "Distinct locations: %i" % N
print "Time series length: %i" % n
#True entropy
lambdai = matchfinder(data)
#S = math.pow(sum([lambdai[i] / math.log(i + 1, 2) for i in range(1, len(data))]) * (1.0 / len(data)), -1)
Etrue = math.pow((np.sum(lambdai)/ n),-1)*log(n,2).real
S = Etrue
print "Maximum entropy: %2.5f" % log(locations,2).real
print "Real entropy: %2.5f" % S
func = lambda x: (-(x * log(x, 2).real + (1 - x) * log(1 - x, 2).real) + (1 - x) * log(N - 1, 2).real) - S
ub = fsolve(func, 0.9)[0]
print "Upper bound of predictability: %2.5f" % ub
return ub
2 个位置的输出
Distinct locations: 2
Time series length: 10000
Maximum entropy: 1.00000
Real entropy: 1.01441
Upper bound of predictability: 0.50013
3 个位置的输出
Distinct locations: 3
Time series length: 10000
Maximum entropy: 1.58496
Real entropy: 1.56567
Upper bound of predictability: 0.41172
Lempel-Ziv 压缩在 n 接近无穷大时收敛于真实熵,这就是为什么对于 2 个位置的情况它略高于最大限制。
我也不确定您是否正确解释了 lambda 的定义。它被定义为 "the length of the shortest substring starting at position i which dosen't previously appear from position 1 to i-1",所以当我们到达某个点时,进一步的子字符串不再是唯一的,你的匹配算法会给它的长度总是比子字符串的长度大一个,而它应该等于 0,因为不存在唯一子字符串。
为了更清楚地说明,我们举一个简单的例子。如果位置数组如下所示:
[1 0 0 1 0 0]
然后我们可以看到前三个位置模式再次重复。这意味着从第四个位置开始,最短的唯一子字符串 不存在 ,因此它等于 0。因此输出 (lambda) 应如下所示:
[1 1 2 0 0 0]
但是,您在这种情况下的功能将 return:
[1 1 2 4 3 2]
我重写了你的匹配函数来解决这个问题:
def matchfinder2(data):
data_len = len(data)
output = np.zeros(len(data))
output[0] = 1
for start_idx in range(1,data_len):
max_subsequence_matched = 0
for i in range(0,start_idx):
j = 0
end_distance = data_len - start_idx #length left to the end of sequence (including current index)
while( (start_idx+j < data_len) and (i+j < start_idx) and (data[i+j] == data[start_idx+j]) ):
j = j + 1
if j == end_distance: #check if j has reached the end of sequence
output[start_idx::] = np.zeros(end_distance) #if yes fill the rest of output with zeros
return output #end function
elif j > max_subsequence_matched:
max_subsequence_matched = j;
output[start_idx] = max_subsequence_matched + 1;
return output
差异当然很小,因为结果仅针对序列的一小部分发生变化。