如何 select 在给定列中至少有两个特定实例的 ID

Question

我正在 pyspark 中处理医疗索赔 table，我只想 return 至少有 2 个 claim_ids 的用户 ID。我的 table 看起来像这样：

claim_id |  userid |  diagnosis_type |  claim_type
__________________________________________________
1            1            C100            M
2            1            C100a           M
3            2            D50             F
5            3            G200            M
6            3            C100            M
7            4            C100a           M
8            4            D50             F
9            4            A25             F

在这个例子中，我只想 return 用户 ID 的 1、3 和 4。目前我正在构建一个临时 table 来计算 claim_ids

的所有不同实例

create table temp.claim_count as
select distinct userid, count(distinct claim_id) as claims
from medical_claims
group by userid

然后当 claim_id 的数量 >1

时从这个 table 中拉出

select distinct userid
from medical_claims
where userid (
    select distinct userid
    from temp.claim_count
    where claims>1)

是否有更好/更有效的方法？

Answer 1

如果您只需要 ID，请使用 group by:

select userid, count(*) as claims
from medical_claims
group by userid
having count(*) > 1;

如果您想要原始行，则使用 window 函数：

select mc.*
from (select mc.*, count(*) over (partition by userid) as num_claims
      from medical_claims mc
     ) mc
where num_claims > 1;

如何 select 在给定列中至少有两个特定实例的 ID

How to select IDs that have at least two specific instaces in a given column

sql

pyspark

apache-zeppelin