Hive,我有一个按时间戳排序的 QA 数据集(ID、时间、内容、角色)。如何将其转换为 (ID, roleA, roleB) 之类的格式?

Hive, I have a QA dataset (ID, time, content, role) ordered by timestamp. How can transpose it to a format like (ID, roleA, roleB)?

我想输出如下数据:

    ID      roleA                           role B
    xxx     is customer service?            yes, how can i help you, how can i help you
    xxx     is customer service?            yes
    xxx     great, why this happens         wait a minute, let me check

我不知道如何使用 sql 或 python 来解决它。

这是一个带条件聚合的间隙和孤岛问题:

select biz_id, send_role, min(create_time) as create_time,
       concat_ws(collect_list(content), ' ') as content
from (select t.*,
             row_number() over (partition by biz_id order by create_time) as seqnum,
             row_number() over (partition by biz_id, send_role order by create_time) as seqnum_2,
      from t
     ) t
group by biz_id, send_role, (seqnum - seqnum_2);

然后用这个,你可以重新聚合得到你想要的:

with x as (
      select biz_id, send_role, min(create_time) as create_time,
             concat_ws(collect_list(content), ' ') as content
      from (select t.*,
                   row_number() over (partition by biz_id order by create_time) as seqnum,
                   row_number() over (partition by biz_id, send_role order by create_time) as seqnum_2,
            from t
           ) t
      group by biz_id, send_role, (seqnum - seqnum_2)
     )
select biz_id,
       max(case when send_role = 2 then content end),
       max(case when send_role = 3 then content end)
from (select x.*,
             row_number() over (partition by biz_id, send_role order by create_time) as seqnum
      from x
     ) x
group by biz_id, seqnum;

注意:这可能会将内容以任意顺序放在“相邻”行上。让这些以“正确”的顺序排列是很棘手的。 . .在您的示例数据中,date/times 是相同的,因此没有明显的排序列。