样本量为 1 是否考虑水库抽样?
Is sample size of 1 consider Reservoir Sampling?
我只想知道我的代码是水库采样。我有一系列我只想处理的综合浏览量。我一次处理一个综合浏览量。但是,由于大多数综合浏览量是相同的,所以我只想随机选择任何综合浏览量(一次处理一个)。例如,我的综合浏览量为
[www.example.com, www.example.com, www.example1.com, www.example3.com, ...]
我一次处理一个元素。这是我的代码。
import random
def __init__(self):
self.counter = 0
def processable():
self.counter += 1
return random.random() < 1.0 / self.counter
遵循水库抽样算法(可在此处找到:https://en.wikipedia.org/wiki/Reservoir_sampling),我们仅存储一个页面浏览量(水库大小 =1),以下实现显示了概率选择策略如何从流式网页浏览量导致统一的选择概率:
import numpy as np
import matplotlib.pyplot as plt
max_num = 10 # maximum number of pageviews we want to consider
# replicate the experiment ntrials times and find the probability for selection of any pageview
pageview_indices = []
ntrials = 10000
for _ in range(ntrials):
pageview_index = None # index of the single pageview to be kept
i = 0
while True: # streaming pageviews
i += 1 # next pageview
if i > max_num:
break
# keep first pageview and from next pageview onwards discard the old one kept with probability 1 - 1/i
pageview_index = 1 if i == 1 else np.random.choice([pageview_index, i], 1, p=[1-1./i, 1./i])[0]
#print 'pageview chosen:', pageview_index
print 'Final pageview chosen:', pageview_index
pageview_indices.append(pageview_index)
plt.hist(pageview_indices, max_num, normed=1, facecolor='green', alpha=0.75)
plt.xlabel('Pageview Index')
plt.ylabel('Probability Chosen')
plt.title('Reservoir Sampling')
plt.axis([0, max_num+1, 0, 0.15])
plt.xticks(range(1, max_num+1))
plt.grid(True)
从上面可以看出,pageview indices被选中的概率几乎是均匀的(10次pageviews都是1/10),从数学上也可以证明是均匀的。
我只想知道我的代码是水库采样。我有一系列我只想处理的综合浏览量。我一次处理一个综合浏览量。但是,由于大多数综合浏览量是相同的,所以我只想随机选择任何综合浏览量(一次处理一个)。例如,我的综合浏览量为
[www.example.com, www.example.com, www.example1.com, www.example3.com, ...]
我一次处理一个元素。这是我的代码。
import random
def __init__(self):
self.counter = 0
def processable():
self.counter += 1
return random.random() < 1.0 / self.counter
遵循水库抽样算法(可在此处找到:https://en.wikipedia.org/wiki/Reservoir_sampling),我们仅存储一个页面浏览量(水库大小 =1),以下实现显示了概率选择策略如何从流式网页浏览量导致统一的选择概率:
import numpy as np
import matplotlib.pyplot as plt
max_num = 10 # maximum number of pageviews we want to consider
# replicate the experiment ntrials times and find the probability for selection of any pageview
pageview_indices = []
ntrials = 10000
for _ in range(ntrials):
pageview_index = None # index of the single pageview to be kept
i = 0
while True: # streaming pageviews
i += 1 # next pageview
if i > max_num:
break
# keep first pageview and from next pageview onwards discard the old one kept with probability 1 - 1/i
pageview_index = 1 if i == 1 else np.random.choice([pageview_index, i], 1, p=[1-1./i, 1./i])[0]
#print 'pageview chosen:', pageview_index
print 'Final pageview chosen:', pageview_index
pageview_indices.append(pageview_index)
plt.hist(pageview_indices, max_num, normed=1, facecolor='green', alpha=0.75)
plt.xlabel('Pageview Index')
plt.ylabel('Probability Chosen')
plt.title('Reservoir Sampling')
plt.axis([0, max_num+1, 0, 0.15])
plt.xticks(range(1, max_num+1))
plt.grid(True)
从上面可以看出,pageview indices被选中的概率几乎是均匀的(10次pageviews都是1/10),从数学上也可以证明是均匀的。