如何从一列中的不同值中抽样,但仅从另一列中唯一的 return 记录进行抽样?
How to sample from different values in a column but only return records that are unique from another column?
我正在努力解决使用 Teradata 的采样问题
数据格式如下
ID Group Rank
1 dog 1
1 cat 1
1 lion 1
1 elephant 2
2 dog 1
2 cat 1
2 lion 1
2 elephant 1
3 dog 1
3 cat 2
3 lion 1
3 elephant 1
4 dog 2
4 cat 1
4 lion 1
4 elephant 1
...
理想情况下,我希望 return 组中每个条目的示例编号,但只有 ID 中的唯一值。
下面是我生成的当前查询,但是这个 returns 与 ID
重复
SELECT ID, Group FROM Table
WHERE rank = 1
SAMPLE
WHEN group = 'dog' then 10
WHEN group = 'cat' then 10
WHEN group = 'elephant' then 5
WHEN group = 'lion' then 5
END
假设您有足够的记录,为每个 id 选择一个随机行,然后从中选择适当的数字:
select t.*
from (select t.*,
row_number() over (partition by group order by seqnum) as sequm_g
from (select t.*,
row_number() over (partition by id order by random(1, 1000000))
from t
) t
where seqnum = 1
) t
where (group in ('dog', 'cat') and seqnum_g <= 10) or
(group in ('elephant', 'lion') and seqnum_g <= 5) ;
这不能保证结果集中的组足够大。但是,如果您有足够的数据来衡量组的大小,那么它应该可以工作。
with cte as
(
SELECT ID, Group,
random(1,10000) as rnd -- RANDOM can't be directly used in OLAP-functions
FROM Table
WHERE rank = 1
)
SELECT ID, Group
FROM cte
QUALIFY
ROW_NUMBER() -- get one random row per ID
OVER (PARTITION BY ID
ORDER BY rnd) = 1
SAMPLE
WHEN group = 'dog' then 10
WHEN group = 'cat' then 10
WHEN group = 'elephant' then 5
WHEN group = 'lion' then 5
END
我正在努力解决使用 Teradata 的采样问题
数据格式如下
ID Group Rank
1 dog 1
1 cat 1
1 lion 1
1 elephant 2
2 dog 1
2 cat 1
2 lion 1
2 elephant 1
3 dog 1
3 cat 2
3 lion 1
3 elephant 1
4 dog 2
4 cat 1
4 lion 1
4 elephant 1
...
理想情况下,我希望 return 组中每个条目的示例编号,但只有 ID 中的唯一值。
下面是我生成的当前查询,但是这个 returns 与 ID
重复SELECT ID, Group FROM Table
WHERE rank = 1
SAMPLE
WHEN group = 'dog' then 10
WHEN group = 'cat' then 10
WHEN group = 'elephant' then 5
WHEN group = 'lion' then 5
END
假设您有足够的记录,为每个 id 选择一个随机行,然后从中选择适当的数字:
select t.*
from (select t.*,
row_number() over (partition by group order by seqnum) as sequm_g
from (select t.*,
row_number() over (partition by id order by random(1, 1000000))
from t
) t
where seqnum = 1
) t
where (group in ('dog', 'cat') and seqnum_g <= 10) or
(group in ('elephant', 'lion') and seqnum_g <= 5) ;
这不能保证结果集中的组足够大。但是,如果您有足够的数据来衡量组的大小,那么它应该可以工作。
with cte as
(
SELECT ID, Group,
random(1,10000) as rnd -- RANDOM can't be directly used in OLAP-functions
FROM Table
WHERE rank = 1
)
SELECT ID, Group
FROM cte
QUALIFY
ROW_NUMBER() -- get one random row per ID
OVER (PARTITION BY ID
ORDER BY rnd) = 1
SAMPLE
WHEN group = 'dog' then 10
WHEN group = 'cat' then 10
WHEN group = 'elephant' then 5
WHEN group = 'lion' then 5
END