Python 中的 random.uniform 行到底是做什么的？

Question

我正在关注关于在 Python 中使用随机森林的 tutorial here from Andrew Cross。我得到了运行的代码，并且在大多数情况下我理解输出。但是，我不确定这一行到底是做什么的：

df['is_train'] = np.random.uniform(0, 1, len(df)) <= .75

我知道它 "creates a (random) uniform distribution between 0 and 1 and assigns 3/4ths of the data to be in the training subset." 但是，训练子集并不总是恰好是子集的 3/4。有时它更小，有时它更大。那么是否选择了大约 75% 的随机大小子集？为什么不让它总是 75%？

Answer 1

np.random.uniform(0, 1, len(df)) 创建一个 len(df) 随机数数组。
<= .75 然后创建另一个包含 True 的数组，其中数字匹配该条件，而 False 在其他地方。
然后代码使用找到 True 的索引中的数据。由于随机分布是......好吧，随机的，你不会得到恰好 75% 的值。

Answer 2

它不会将 3/4 的数据分配到训练子集中。
它将数据在训练子集中的概率分配为 3/4:

示例：

>>> import numpy as np
>>> sum(np.random.uniform(0, 1, 10) < .75)
8
>>> sum(np.random.uniform(0, 1, 10) < .75)
10
>>> sum(np.random.uniform(0, 1, 10) < .75)
7

80% 的数据在第一个例子的训练子集中
100% -- 在第二个
70% -- 在第 3 位。

平均来说，应该是 75%。

Answer 3

如果你想更严格地随机选择总是非常接近 75% 的训练集，你可以使用这样的代码：

d = np.random.uniform(0, 1, 1000)
p = np.percentile(d, 75)

print(np.sum(d <= p))   # 750
print(np.sum(d <= .75)) # 745

在你的例子中：

d = np.random.uniform(0, 1, len(df))
p = np.percentile(d, 75)
df['is_train'] = d <= p

Python 中的 random.uniform 行到底是做什么的？

What exactly does this random.uniform line in Python do?

python

random

random-forest