使用 Hive 的随机样本 table，但包括匹配的行

Question

我有一个很大的 table 包含一个 userID 列和其他用户变量列，我想使用 Hive 根据他们的 userID 随机抽取用户样本.此外，有时这些用户会在多行上，如果随机选择的 userID 包含在 table 的其他部分，我也想提取这些行。

我查看了 the Hive sampling documentation，我发现可以执行类似这样的操作来提取 1% 的样本：

SELECT * FROM source 
TABLESAMPLE (1 PERCENT) s;

但我不确定如何在我希望也选择那些 1% userID 的所有其他实例的地方添加约束。

Answer 1

您可以使用 rand() 随机拆分数据，并在您的类别中使用适当百分比的用户 ID。我推荐 rand() 因为将种子设置为某种东西可以使结果可重复。

select c.*
from 
(select userID
, if(rand(5555)<0.1, 'test','train') end as type
    from
    (select userID 
    from mytable 
    group by userID
    ) a
) b
right outer join
(select *
from userID
) c
on a.userid=c.userid
where type='test'
;

这是为实体级建模目的而设置的，这就是我将测试和训练作为类型的原因。

使用 Hive 的随机样本 table，但包括匹配的行

Random sample table with Hive, but including matching rows

hive

random-sample

hiveql