给定来自均匀分布 [0,d] 的 n 个样本,你将如何估计 d?
Given n samples from a uniform distribution [0,d], how would you estimate d?
我认为有两种方法可以解决这个问题。
一个是从样本集中取 MAX,另一个是取 2 x 样本均值。
我在网上找到了一个解决方案,它试图创建这些分布来比较两者,但是,它的写法很不寻常(因为语句跟在实际语句之后)。我试图重写它,但我的代码有些问题。它看起来不像是 运行 函数多次并随着样本量的增加比较结果。感谢任何帮助。
我的代码
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
def sample_random_normal(n = 100):
for i in range(1,100):
for j in [np.random.uniform(0, n, size = i).astype(int)]:
return np.array([np.array([max(j), 2*np.mean(j)])])
def repeat_experiment():
for _ in range(1,100):
experiments = np.array([sample_random_normal()])
return experiments.mean(axis = 0)
result = repeat_experiment()
df = pd.DataFrame(result)
df.columns = ['max_value', '2*mean']
df['k'] = pd.Series(range(1,100))
df['actual_value'] = 100
df['max_value-actual-value'] = df['max_value'] - df['actual_value']
df['2*mean-actual_value'] = df['2*mean'] - df['actual_value']
plt.plot(df['k'], df['max_value'], linestyle = 'solid', label = 'max_value_estimate')
plt.plot(df['k'], df['2*mean'], linestyle = 'dashed', label = '2*mean estimate')
plt.plot(df['k'], df['max_value-actual-value'], linestyle = 'solid', label = 'max_value_estimate')
plt.plot(df['k'], df['2*mean-actual_value'], linestyle = 'dashed', label = '2*mean estimate')
plt.legend()
plt.show()
原码
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
def sample_random_normal(n = 100):
return np.array([np.array([max(j), 2*np.mean(j)]) for j in [np.random.uniform(0, n, size=i).astype(int) for i in range(1, 100)]])
def repeat_experiment():
experiments = np.array([sample_random_normal() for _ in range(100)])
return experiments.mean(axis = 0)
result = repeat_experiment()
df = pd.DataFrame(result)
df.columns = ['max_value', '2*mean']
df['k'] = range(1, 100)
df['actual_value'] = 100
df['max_value-actual-value'] = df['max_value'] - df['actual_value']
df['2*mean-actual-value'] = df['2*mean'] - df['actual_value']
plt.plot(df['k'], df['max_value'], linestyle='solid', label='max_value_estimate')
plt.plot(df['k'], df['2*mean'], linestyle='dashed', label ='2*mean estimate')
plt.legend()
plt.show()
看这里:
def sample_random_normal(n = 100):
for i in range(1,100):
for j in [np.random.uniform(0, n, size = i).astype(int)]:
return np.array([np.array([max(j), 2*np.mean(j)])])
对于范围内的第一个 i
和 j
,您的函数会找到一个 return 语句并停止。更正为:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
def sample_random_normal(n = 100):
samples = [np.random.uniform(0, n, size = i).astype(int) for i in range(1,100)]
return np.array([np.array([max(j), 2*np.mean(j)]) for j in samples])
def repeat_experiment():
experiments = np.array([sample_random_normal() for _ in range(100)])
return experiments.mean(axis = 0)
result = repeat_experiment()
df = pd.DataFrame(result)
df.columns = ['max_value', '2*mean']
df['k'] = pd.Series(range(1,100))
df['actual_value'] = 100
df['max_value-actual-value'] = df['max_value'] - df['actual_value']
df['2*mean-actual_value'] = df['2*mean'] - df['actual_value']
plt.plot(df['k'], df['max_value'], linestyle = 'solid', label = 'max_value_estimate')
plt.plot(df['k'], df['2*mean'], linestyle = 'dashed', label = '2*mean estimate')
plt.plot(df['k'], df['max_value-actual-value'], linestyle = 'solid', label = 'max_value-actual-value')
plt.plot(df['k'], df['2*mean-actual_value'], linestyle = 'dashed', label = '2*mean-actual_value')
plt.legend()
plt.show()
结果是:
你刚刚证明这两个估计量是一致的。但是请注意,最大估计量不是无偏的,平均值的 2 倍是无偏的。然而,这更像是一个 math/statistic 问题;如果有兴趣,请参阅 this question from math.stackexchange。
此外,我修正了你的传说,因为它们之前是错误的。
还有第三种解决方案,它比您提出的两种方法更好。正如 pavel 在评论中指出的那样,盟国使用它来估计德国人在二战中生产了多少辆坦克。使用观察到的最大值会产生有偏差的估计,而将平均值加倍具有高变异性——将随机变量(样本均值)缩放 2 倍方差。
频率论者和贝叶斯统计学家得出的解决方案是缩放最大观测值。使用您的符号,其中 n
是样本量,d
是总体最大值,max
是最大观察值,估计量 max * (1 + 1/n) - 1
是最小方差无偏估计d
样本量 > 1。由于 econbernardo 选择不更新他们的答案以包含此内容,我将其添加到此处。
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
def sample_random_normal(n = 100):
samples = [np.random.uniform(0, n, size = i).astype(int) for i in range(1,100)]
return np.array([np.array([max(j), max(j)*(1 + 1/len(j)) - 1, 2*np.mean(j)]) for j in samples])
def repeat_experiment():
experiments = np.array([sample_random_normal() for _ in range(100)])
return experiments.mean(axis = 0)
result = repeat_experiment()
df = pd.DataFrame(result)
df.columns = ['max_value', 'unbiased_max', '2*mean']
df['k'] = pd.Series(range(1,100))
df['actual_value'] = 100
df['max_value-actual-value'] = df['max_value'] - df['actual_value']
df['unbiased_max-actual-value'] = df['unbiased_max'] - df['actual_value']
df['2*mean-actual_value'] = df['2*mean'] - df['actual_value']
plt.plot(df['k'], df['max_value'], linestyle = 'dotted', label = 'max_value_estimate')
plt.plot(df['k'], df['unbiased_max'], linestyle = 'solid', label = 'unbiased_max_estimate')
plt.plot(df['k'], df['2*mean'], linestyle = 'dashed', label = '2*mean estimate')
plt.plot(df['k'], df['max_value-actual-value'], linestyle = 'dotted', label = 'max_value-actual-value')
plt.plot(df['k'], df['unbiased_max-actual-value'], linestyle = 'solid', label = 'unbiased_max-actual-value')
plt.plot(df['k'], df['2*mean-actual_value'], linestyle = 'dashed', label = '2*mean-actual_value')
plt.legend()
plt.show()
下面给出了结果的示例图。实线是基于观察到的最大值的无偏 MVUE,其他两条来自您提出的解决方案。
如您所见,缩放最大值估计器主导了其他两个的性能,它没有香草最大值的偏差并且比样本均值加倍具有更小的可变性。
我认为有两种方法可以解决这个问题。
一个是从样本集中取 MAX,另一个是取 2 x 样本均值。
我在网上找到了一个解决方案,它试图创建这些分布来比较两者,但是,它的写法很不寻常(因为语句跟在实际语句之后)。我试图重写它,但我的代码有些问题。它看起来不像是 运行 函数多次并随着样本量的增加比较结果。感谢任何帮助。
我的代码
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
def sample_random_normal(n = 100):
for i in range(1,100):
for j in [np.random.uniform(0, n, size = i).astype(int)]:
return np.array([np.array([max(j), 2*np.mean(j)])])
def repeat_experiment():
for _ in range(1,100):
experiments = np.array([sample_random_normal()])
return experiments.mean(axis = 0)
result = repeat_experiment()
df = pd.DataFrame(result)
df.columns = ['max_value', '2*mean']
df['k'] = pd.Series(range(1,100))
df['actual_value'] = 100
df['max_value-actual-value'] = df['max_value'] - df['actual_value']
df['2*mean-actual_value'] = df['2*mean'] - df['actual_value']
plt.plot(df['k'], df['max_value'], linestyle = 'solid', label = 'max_value_estimate')
plt.plot(df['k'], df['2*mean'], linestyle = 'dashed', label = '2*mean estimate')
plt.plot(df['k'], df['max_value-actual-value'], linestyle = 'solid', label = 'max_value_estimate')
plt.plot(df['k'], df['2*mean-actual_value'], linestyle = 'dashed', label = '2*mean estimate')
plt.legend()
plt.show()
原码
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
def sample_random_normal(n = 100):
return np.array([np.array([max(j), 2*np.mean(j)]) for j in [np.random.uniform(0, n, size=i).astype(int) for i in range(1, 100)]])
def repeat_experiment():
experiments = np.array([sample_random_normal() for _ in range(100)])
return experiments.mean(axis = 0)
result = repeat_experiment()
df = pd.DataFrame(result)
df.columns = ['max_value', '2*mean']
df['k'] = range(1, 100)
df['actual_value'] = 100
df['max_value-actual-value'] = df['max_value'] - df['actual_value']
df['2*mean-actual-value'] = df['2*mean'] - df['actual_value']
plt.plot(df['k'], df['max_value'], linestyle='solid', label='max_value_estimate')
plt.plot(df['k'], df['2*mean'], linestyle='dashed', label ='2*mean estimate')
plt.legend()
plt.show()
看这里:
def sample_random_normal(n = 100):
for i in range(1,100):
for j in [np.random.uniform(0, n, size = i).astype(int)]:
return np.array([np.array([max(j), 2*np.mean(j)])])
对于范围内的第一个 i
和 j
,您的函数会找到一个 return 语句并停止。更正为:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
def sample_random_normal(n = 100):
samples = [np.random.uniform(0, n, size = i).astype(int) for i in range(1,100)]
return np.array([np.array([max(j), 2*np.mean(j)]) for j in samples])
def repeat_experiment():
experiments = np.array([sample_random_normal() for _ in range(100)])
return experiments.mean(axis = 0)
result = repeat_experiment()
df = pd.DataFrame(result)
df.columns = ['max_value', '2*mean']
df['k'] = pd.Series(range(1,100))
df['actual_value'] = 100
df['max_value-actual-value'] = df['max_value'] - df['actual_value']
df['2*mean-actual_value'] = df['2*mean'] - df['actual_value']
plt.plot(df['k'], df['max_value'], linestyle = 'solid', label = 'max_value_estimate')
plt.plot(df['k'], df['2*mean'], linestyle = 'dashed', label = '2*mean estimate')
plt.plot(df['k'], df['max_value-actual-value'], linestyle = 'solid', label = 'max_value-actual-value')
plt.plot(df['k'], df['2*mean-actual_value'], linestyle = 'dashed', label = '2*mean-actual_value')
plt.legend()
plt.show()
结果是:
你刚刚证明这两个估计量是一致的。但是请注意,最大估计量不是无偏的,平均值的 2 倍是无偏的。然而,这更像是一个 math/statistic 问题;如果有兴趣,请参阅 this question from math.stackexchange。
此外,我修正了你的传说,因为它们之前是错误的。
还有第三种解决方案,它比您提出的两种方法更好。正如 pavel 在评论中指出的那样,盟国使用它来估计德国人在二战中生产了多少辆坦克。使用观察到的最大值会产生有偏差的估计,而将平均值加倍具有高变异性——将随机变量(样本均值)缩放 2 倍方差。
频率论者和贝叶斯统计学家得出的解决方案是缩放最大观测值。使用您的符号,其中 n
是样本量,d
是总体最大值,max
是最大观察值,估计量 max * (1 + 1/n) - 1
是最小方差无偏估计d
样本量 > 1。由于 econbernardo 选择不更新他们的答案以包含此内容,我将其添加到此处。
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
def sample_random_normal(n = 100):
samples = [np.random.uniform(0, n, size = i).astype(int) for i in range(1,100)]
return np.array([np.array([max(j), max(j)*(1 + 1/len(j)) - 1, 2*np.mean(j)]) for j in samples])
def repeat_experiment():
experiments = np.array([sample_random_normal() for _ in range(100)])
return experiments.mean(axis = 0)
result = repeat_experiment()
df = pd.DataFrame(result)
df.columns = ['max_value', 'unbiased_max', '2*mean']
df['k'] = pd.Series(range(1,100))
df['actual_value'] = 100
df['max_value-actual-value'] = df['max_value'] - df['actual_value']
df['unbiased_max-actual-value'] = df['unbiased_max'] - df['actual_value']
df['2*mean-actual_value'] = df['2*mean'] - df['actual_value']
plt.plot(df['k'], df['max_value'], linestyle = 'dotted', label = 'max_value_estimate')
plt.plot(df['k'], df['unbiased_max'], linestyle = 'solid', label = 'unbiased_max_estimate')
plt.plot(df['k'], df['2*mean'], linestyle = 'dashed', label = '2*mean estimate')
plt.plot(df['k'], df['max_value-actual-value'], linestyle = 'dotted', label = 'max_value-actual-value')
plt.plot(df['k'], df['unbiased_max-actual-value'], linestyle = 'solid', label = 'unbiased_max-actual-value')
plt.plot(df['k'], df['2*mean-actual_value'], linestyle = 'dashed', label = '2*mean-actual_value')
plt.legend()
plt.show()
下面给出了结果的示例图。实线是基于观察到的最大值的无偏 MVUE,其他两条来自您提出的解决方案。
如您所见,缩放最大值估计器主导了其他两个的性能,它没有香草最大值的偏差并且比样本均值加倍具有更小的可变性。