Combining/collapsing 行 group_by + 条件为下一行的列
Combining/collapsing rows with group_by + conditional on column for next row
我看到的是一家电信公司的新客户聊天记录。聊天中,一位客户和公司代表在聊天。
我正在尝试折叠聊天,以减少行数。下图显示了前数据的外观以及后数据的外观。
之前
之后
我看过以下文章:
- SQL - How to combine rows based on unique values
- Optimal way to concatenate/aggregate strings
- How to sort the result from string_agg()
我试过这个代码:
select
unique_id, string_agg(concat(text, ' ', text), ', ')
from
conversation
group by
unique_id, user
但是,这不会在必要时折叠它。它将它完全折叠成 2 条线,一条用于客户,另一条用于公司。我正在寻找的逻辑是如果此查询中的下一行包含相同的 unique_id,则用户将当前行文本字段与下一行文本字段连接起来。
这是 SQL Fiddle 页面,但我 运行 此代码位于 SQL 服务器中 string_agg
:http://sqlfiddle.com/#!9/5ad86c/3
如果你看看我的 Whosebug 历史,我已经在 R 中请求了一个几乎相似的算法。
CREATE TABLE conversation
(
`unique_id` double,
`line_no` int,
`user` varchar(7000),
`text` varchar(7000)
);
INSERT INTO conversation (`unique_id`, `line_no`, `user`, `text`)
VALUES
(50314585222, 1, 'customer', 'Hi I would like to sign up for a service'),
(50314585222, 2, 'company', 'Hi My name is Alex. We can offer the following plans. We also have signup bonuses, with doubling of data for 12 months '),
(50314585222, 3, 'company', 'Plan1: 40GB data, with monthly price of '),
(50314585222, 4, 'company', 'Plan2: 20GB data, with monthly price of '),
(50314585222, 5, 'company', 'Plan3: 5GB data, with monthly price of '),
(50314585222, 6, 'customer', 'I was hoping for a much smaller plan, with only voice service'),
(50314585222, 7, 'customer', 'maybe the per month plan.'),
(50319875222, 4, 'customer', 'so how do I sign up'),
(50319875222, 5, 'customer', '*for the service'),
(50319875222, 7, 'company', 'maybe I can call you for your details?')
;
如果我理解正确的话,下一个方法是一个可能的解决方案。您需要找到更改并定义适当的组:
Table:
CREATE TABLE [conversation]
(
[unique_id] bigint,
[line_no] int,
[user] varchar(7000),
[text] varchar(7000)
);
INSERT INTO [conversation] ([unique_id], [line_no], [user], [text])
VALUES
(50314585222, 1, 'customer', 'Hi I would like to sign up for a service'),
(50314585222, 2, 'company', 'Hi My name is Alex. We can offer the following plans. We also have signup bonuses, with doubling of data for 12 months '),
(50314585222, 3, 'company', 'Plan1: 40GB data, with monthly price of '),
(50314585222, 4, 'company', 'Plan2: 20GB data, with monthly price of '),
(50314585222, 5, 'company', 'Plan3: 5GB data, with monthly price of '),
(50314585222, 6, 'customer', 'I was hoping for a much smaller plan, with only voice service'),
(50314585222, 7, 'customer', 'maybe the per month plan.'),
(50319875222, 4, 'customer', 'so how do I sign up'),
(50319875222, 5, 'customer', '*for the service'),
(50319875222, 7, 'company', 'maybe I can call you for your details?')
;
声明:
; WITH ChangesCTE AS (
SELECT
*,
LAG([user]) OVER (PARTITION BY [unique_id] ORDER BY [line_no]) AS prev_user
FROM [conversation]
), GroupsCTE AS (
SELECT
*,
SUM(CASE WHEN [user] <> [prev_user] OR [prev_user] IS NULL THEN 1 ELSE 0 END) OVER (PARTITION BY [unique_id] ORDER BY [line_no]) AS [group_id]
FROM ChangesCTE
)
SELECT
[unique_id],
MIN([line_no]) AS [line_no],
MIN([user]) AS [user],
STRING_AGG([text], ' ') WITHIN GROUP (ORDER BY [line_no]) AS [text]
FROM GroupsCTE
GROUP BY [unique_id], [group_id]
ORDER BY [unique_id]
结果:
unique_id line_no user text
50314585222 1 customer Hi I would like to sign up for a service
50314585222 2 company Hi My name is Alex. We can offer the following plans. We also have signup bonuses, with doubling of data for 12 months Plan1: 40GB data, with monthly price of Plan2: 20GB data, with monthly price of Plan3: 5GB data, with monthly price of
50314585222 6 customer I was hoping for a much smaller plan, with only voice service maybe the per month plan.
50319875222 4 customer so how do I sign up *for the service
50319875222 7 company maybe I can call you for your details?
这是一个间隙和孤岛问题,您希望将同一说话者的相邻行组合在一起。
为了解决这个问题,您需要一个列来对记录进行排序。看来我们不能使用 line_no
,它在同一个对话中有重复的值。我仍然假设存在这样的列,并称为 ordering_col
.
select
unique_id,
min(line_no) line_no,
user,
string_agg(text) within group(order by ordering_id) text
from (
select
t.*,
row_number() over(partition by unique_id order by ordering_id) rn1,
row_number() over(partition by unique_id, user order by ordering_id) rn2
from mytable t
) t
group by unique_id, user, rn1 - rn2
order by unique_id, min(ordering_id)
我看到的是一家电信公司的新客户聊天记录。聊天中,一位客户和公司代表在聊天。
我正在尝试折叠聊天,以减少行数。下图显示了前数据的外观以及后数据的外观。
之前
之后
我看过以下文章:
- SQL - How to combine rows based on unique values
- Optimal way to concatenate/aggregate strings
- How to sort the result from string_agg()
我试过这个代码:
select
unique_id, string_agg(concat(text, ' ', text), ', ')
from
conversation
group by
unique_id, user
但是,这不会在必要时折叠它。它将它完全折叠成 2 条线,一条用于客户,另一条用于公司。我正在寻找的逻辑是如果此查询中的下一行包含相同的 unique_id,则用户将当前行文本字段与下一行文本字段连接起来。
这是 SQL Fiddle 页面,但我 运行 此代码位于 SQL 服务器中 string_agg
:http://sqlfiddle.com/#!9/5ad86c/3
如果你看看我的 Whosebug 历史,我已经在 R 中请求了一个几乎相似的算法。
CREATE TABLE conversation
(
`unique_id` double,
`line_no` int,
`user` varchar(7000),
`text` varchar(7000)
);
INSERT INTO conversation (`unique_id`, `line_no`, `user`, `text`)
VALUES
(50314585222, 1, 'customer', 'Hi I would like to sign up for a service'),
(50314585222, 2, 'company', 'Hi My name is Alex. We can offer the following plans. We also have signup bonuses, with doubling of data for 12 months '),
(50314585222, 3, 'company', 'Plan1: 40GB data, with monthly price of '),
(50314585222, 4, 'company', 'Plan2: 20GB data, with monthly price of '),
(50314585222, 5, 'company', 'Plan3: 5GB data, with monthly price of '),
(50314585222, 6, 'customer', 'I was hoping for a much smaller plan, with only voice service'),
(50314585222, 7, 'customer', 'maybe the per month plan.'),
(50319875222, 4, 'customer', 'so how do I sign up'),
(50319875222, 5, 'customer', '*for the service'),
(50319875222, 7, 'company', 'maybe I can call you for your details?')
;
如果我理解正确的话,下一个方法是一个可能的解决方案。您需要找到更改并定义适当的组:
Table:
CREATE TABLE [conversation]
(
[unique_id] bigint,
[line_no] int,
[user] varchar(7000),
[text] varchar(7000)
);
INSERT INTO [conversation] ([unique_id], [line_no], [user], [text])
VALUES
(50314585222, 1, 'customer', 'Hi I would like to sign up for a service'),
(50314585222, 2, 'company', 'Hi My name is Alex. We can offer the following plans. We also have signup bonuses, with doubling of data for 12 months '),
(50314585222, 3, 'company', 'Plan1: 40GB data, with monthly price of '),
(50314585222, 4, 'company', 'Plan2: 20GB data, with monthly price of '),
(50314585222, 5, 'company', 'Plan3: 5GB data, with monthly price of '),
(50314585222, 6, 'customer', 'I was hoping for a much smaller plan, with only voice service'),
(50314585222, 7, 'customer', 'maybe the per month plan.'),
(50319875222, 4, 'customer', 'so how do I sign up'),
(50319875222, 5, 'customer', '*for the service'),
(50319875222, 7, 'company', 'maybe I can call you for your details?')
;
声明:
; WITH ChangesCTE AS (
SELECT
*,
LAG([user]) OVER (PARTITION BY [unique_id] ORDER BY [line_no]) AS prev_user
FROM [conversation]
), GroupsCTE AS (
SELECT
*,
SUM(CASE WHEN [user] <> [prev_user] OR [prev_user] IS NULL THEN 1 ELSE 0 END) OVER (PARTITION BY [unique_id] ORDER BY [line_no]) AS [group_id]
FROM ChangesCTE
)
SELECT
[unique_id],
MIN([line_no]) AS [line_no],
MIN([user]) AS [user],
STRING_AGG([text], ' ') WITHIN GROUP (ORDER BY [line_no]) AS [text]
FROM GroupsCTE
GROUP BY [unique_id], [group_id]
ORDER BY [unique_id]
结果:
unique_id line_no user text
50314585222 1 customer Hi I would like to sign up for a service
50314585222 2 company Hi My name is Alex. We can offer the following plans. We also have signup bonuses, with doubling of data for 12 months Plan1: 40GB data, with monthly price of Plan2: 20GB data, with monthly price of Plan3: 5GB data, with monthly price of
50314585222 6 customer I was hoping for a much smaller plan, with only voice service maybe the per month plan.
50319875222 4 customer so how do I sign up *for the service
50319875222 7 company maybe I can call you for your details?
这是一个间隙和孤岛问题,您希望将同一说话者的相邻行组合在一起。
为了解决这个问题,您需要一个列来对记录进行排序。看来我们不能使用 line_no
,它在同一个对话中有重复的值。我仍然假设存在这样的列,并称为 ordering_col
.
select
unique_id,
min(line_no) line_no,
user,
string_agg(text) within group(order by ordering_id) text
from (
select
t.*,
row_number() over(partition by unique_id order by ordering_id) rn1,
row_number() over(partition by unique_id, user order by ordering_id) rn2
from mytable t
) t
group by unique_id, user, rn1 - rn2
order by unique_id, min(ordering_id)