Hive,我有一个按时间戳排序的 QA 数据集(ID、时间、内容、角色)。如何将其转换为 (ID, roleA, roleB) 之类的格式?
Hive, I have a QA dataset (ID, time, content, role) ordered by timestamp. How can transpose it to a format like (ID, roleA, roleB)?
我想输出如下数据:
ID roleA role B
xxx is customer service? yes, how can i help you, how can i help you
xxx is customer service? yes
xxx great, why this happens wait a minute, let me check
我不知道如何使用 sql 或 python 来解决它。
这是一个带条件聚合的间隙和孤岛问题:
select biz_id, send_role, min(create_time) as create_time,
concat_ws(collect_list(content), ' ') as content
from (select t.*,
row_number() over (partition by biz_id order by create_time) as seqnum,
row_number() over (partition by biz_id, send_role order by create_time) as seqnum_2,
from t
) t
group by biz_id, send_role, (seqnum - seqnum_2);
然后用这个,你可以重新聚合得到你想要的:
with x as (
select biz_id, send_role, min(create_time) as create_time,
concat_ws(collect_list(content), ' ') as content
from (select t.*,
row_number() over (partition by biz_id order by create_time) as seqnum,
row_number() over (partition by biz_id, send_role order by create_time) as seqnum_2,
from t
) t
group by biz_id, send_role, (seqnum - seqnum_2)
)
select biz_id,
max(case when send_role = 2 then content end),
max(case when send_role = 3 then content end)
from (select x.*,
row_number() over (partition by biz_id, send_role order by create_time) as seqnum
from x
) x
group by biz_id, seqnum;
注意:这可能会将内容以任意顺序放在“相邻”行上。让这些以“正确”的顺序排列是很棘手的。 . .在您的示例数据中,date/times 是相同的,因此没有明显的排序列。
我想输出如下数据:
ID roleA role B
xxx is customer service? yes, how can i help you, how can i help you
xxx is customer service? yes
xxx great, why this happens wait a minute, let me check
我不知道如何使用 sql 或 python 来解决它。
这是一个带条件聚合的间隙和孤岛问题:
select biz_id, send_role, min(create_time) as create_time,
concat_ws(collect_list(content), ' ') as content
from (select t.*,
row_number() over (partition by biz_id order by create_time) as seqnum,
row_number() over (partition by biz_id, send_role order by create_time) as seqnum_2,
from t
) t
group by biz_id, send_role, (seqnum - seqnum_2);
然后用这个,你可以重新聚合得到你想要的:
with x as (
select biz_id, send_role, min(create_time) as create_time,
concat_ws(collect_list(content), ' ') as content
from (select t.*,
row_number() over (partition by biz_id order by create_time) as seqnum,
row_number() over (partition by biz_id, send_role order by create_time) as seqnum_2,
from t
) t
group by biz_id, send_role, (seqnum - seqnum_2)
)
select biz_id,
max(case when send_role = 2 then content end),
max(case when send_role = 3 then content end)
from (select x.*,
row_number() over (partition by biz_id, send_role order by create_time) as seqnum
from x
) x
group by biz_id, seqnum;
注意:这可能会将内容以任意顺序放在“相邻”行上。让这些以“正确”的顺序排列是很棘手的。 . .在您的示例数据中,date/times 是相同的,因此没有明显的排序列。