在大查询中进行分层抽样的有效方法

Question

我有一个名为 TableA 的 Google 大查询 table，它有大约 3M 条记录。有一个名为 DimA（维度 A）的列，它有 20 个值 - 1 到 20。DimA 的每个值的计数显示在在下面的总计列中汇总 table。我做了一些分析并确定了我应该从 DimA 的每个值中抽取多少随机样本，它显示在列 Sample 中。 DimA 的每个值抽取的样本百分比显示在列 DimA_value_perc 中。我知道如何使用 table 下面的代码通过蛮力进行采样。但是，随着 DimA 值数量的增加以及存在其他维度，此代码不可扩展。有没有更有效的分层抽样方法？谢谢

DimA	Total	Sample	DimA_value_perc
1	115,623	3,077	3%
2	108,203	3,943	4%
3	153,477	6,802	4%
4	232,252	12,426	5%
5	223,004	14,052	6%
6	242,386	17,589	7%
7	121,519	9,783	8%
8	371,342	34,026	9%
9	147,683	15,400	10%
10	281,101	32,775	12%
11	93,380	12,075	13%
12	181,293	25,675	14%
13	122,206	19,344	16%
14	140,559	25,141	18%
15	95,576	19,498	20%
16	94,319	21,969	23%
17	108,282	30,054	28%
18	94,920	33,228	35%
19	82,764	39,700	48%
20	28,417	23,442	82%
Grand Total	3,038,306	400,000

SELECT *
FROM tableA
where DimA = 1
order by rand()
limit  3077 

union all

SELECT *
FROM tableA
where DimA = 2
order by rand()
limit   3943 

etc

Answer 1

考虑以下方法

with samples as (
  select 1 DimA, 115623 Total, 3077 Sample, 3 DimA_value_perc union all
  select 2, 108203, 3943, 4 union all
  select 3, 153477, 6802, 4 union all
  select 4, 232252, 12426, 5 union all
  select 5, 223004, 14052, 6 union all
  select 6, 242386, 17589, 7 union all
  select 7, 121519, 9783, 8 union all
  select 8, 371342, 34026, 9 union all
  select 9, 147683, 15400, 10 union all
  select 10, 281101, 32775, 12 union all
  select 11, 93380, 12075, 13 union all
  select 12, 181293, 25675, 14 union all
  select 13, 122206, 19344, 16 union all
  select 14, 140559, 25141, 18 union all
  select 15, 95576, 19498, 20 union all
  select 16, 94319, 21969, 23 union all
  select 17, 108282, 30054, 28 union all
  select 18, 94920, 33228, 35 union all
  select 19, 82764, 39700, 48 union all
  select 20, 28417, 23442, 82 
)
select a.* except(pos) from (
  select *, row_number() over(partition by DimA order by rand()) pos
  from tableA
) a
join samples 
using(DimA)
where pos <= Sample

在大查询中进行分层抽样的有效方法

efficient way to do stratified sampling in big query

sql

google-bigquery