来自与分数成比例的组的样本

Question

我有以下结构的数据：

CREATE TABLE if not EXISTS scores (
  id int,
  class char,
  score float
);

INSERT INTO scores VALUES
(1, 'A', 0.5),
(1, 'B', 0.2),
(1, 'C', 0.1),
(2, 'A', 0.1),
(2, 'B', 0.2),
(3, 'D', 0.01),
(4, 'A', 0.5),
(4, 'B', 0.5);

我想为每个 id 随机抽样 class。可能获得的样本是：

1,'A'
2,'B'
3,'D'
4,'A'

我想用于示例的逻辑如下：每个 class 都根据其分数按比例进行采样。例如：

在id = 1中，抽样class'B'被抽样的可能性应该是class'C'.
在id = 2中，抽样class'B'被抽样的可能性应该是class'A'.
在id = 3中，我们应该只采样class 'D'.
在 id = 4 中，采样 class 'B' 应该与采样 class 'A'.

我正在 BigQuery/PostgreSQL 中寻找实现此目的的方法。另外，是否有可以复制的固定随机种子的解决方案？

谢谢！

Answer 1

可能的方法是为每个 'id-class' 对生成与分数相等的行数（50 个“1-A”行、20 个“1-B”行、10 个“1-C”行等...）然后 select 每个 id 随机 1 行。

对于 BigQuery：

select id, array_agg(class order by rand() limit 1)[offset(0)]
from scores, unnest(generate_array(1, score * 100))
group by id

Answer 2

在 PostgreSQL 我知道 2 种方式。

第一种方式 与 DISTINCT 和 RANDOM():

SELECT DISTINCT ON (id) id, class
FROM scores 
ORDER BY id, random();

第二种方式 与 OVER PARTITION BY 和 RANDOM():

SELECT id, class 
FROM (
SELECT *, row_number() OVER (PARTITION BY id ORDER BY random()) as rn
FROM scores ) sub
WHERE rn = 1;

您可以在 DB<>FIDDLE

上查看这两个查询

注意，如果你运行同时查询，你会得到不同的记录。

Answer 3

如果我没理解错的话，你有一个本质上有权重的列。您想将它们用于随机抽样，为每个 id 提取一行，但该行的可能性基于权重。

想法是执行以下操作：

将权重归一化为介于 0 和 1 之间的范围。您可以使用累加和除法来执行此操作。
每个 id 选择一个随机数。
比较两者。

逻辑如下：

select s.*
from (select s.*, 
             sum(score) over (partition by id order by class) / sum(score) over (partition by id) as threshold_hi,
             (sum(score) over (partition by id order by class) - score) / sum(score) over (partition by id) as threshold_lo
      from scores s
     ) s join
     (select i.id, random() as rand
      from (select distinct id from scores) i
     ) i
     on i.id = s.id and
        i.rand >= s.threshold_lo and i.rand < s.threshold_hi

Here 是一个 db<>fiddle.

来自与分数成比例的组的样本

Sample from groups proportional to score

sql

random

postgresql

sampling

google-bigquery