Hive - 通过跨组聚合值来创建映射列类型
Hive - Create map columns type by aggregating values across groups
我有一个 table 看起来像这样:
|customer|category|room|date|
-----------------------------
|1 | A | aa | d1 |
|1 | A | bb | d2 |
|1 | B | cc | d3 |
|1 | C | aa | d1 |
|1 | C | bb | d2 |
|2 | A | aa | d3 |
|2 | A | bb | d4 |
|2 | C | bb | d4 |
|2 | C | ee | d5 |
|3 | D | ee | d6 |
我想用 table:
创建两张地图
第一。 map_customer_room_date:将按客户分组并收集所有不同的房间(key) 和日期 (value).
我正在使用 collect()
UDF Brickhouse 函数。
这可以用类似的东西存档:
select customer, collect(room,date) as map_customer_room_date
from table
group by customer
2nd. map_category_room_date 有点复杂,也包含相同的地图类型 collect(room, date)
和它将包含所有类别的所有房间作为键,其中客户 X 是类别。
这意味着对于 customer1 它将占用空间 ee
,即使它属于 customer2。这是因为客户 1 具有类别 C
,并且该类别也存在于客户 2 中。
最终 table 按客户分组,看起来像:
|customer| map_customer_room_date | map_category_room_date |
-------------------------------------------------------------------|
| 1 |{aa: d1, bb: d2, cc: d3} |{aa: d1, bb: d2, cc: d3,ee: d6}|
| 2 |{aa: d3, bb: d4, ee: d6} |{aa: d3, bb: d4, ee: d6} |
| 3 |{ee: d6} |{ee: d6} |
我在构建第二张地图和呈现最终 table 时遇到问题,如所述。
知道如何实现吗?
这可以在将结果组合成 2 个地图之前使用一系列自连接来查找同一类别的其他房间来完成。
代码
CREATE TABLE `table` AS
SELECT 1 AS customer, 'A' AS category, 'aa' AS room, 'd1' AS `date` UNION ALL
SELECT 1 AS customer, 'A' AS category, 'bb' AS room, 'd2' AS `date` UNION ALL
SELECT 1 AS customer, 'B' AS category, 'cc' AS room, 'd3' AS `date` UNION ALL
SELECT 1 AS customer, 'C' AS category, 'aa' AS room, 'd1' AS `date` UNION ALL
SELECT 1 AS customer, 'C' AS category, 'bb' AS room, 'd2' AS `date` UNION ALL
SELECT 2 AS customer, 'A' AS category, 'aa' AS room, 'd3' AS `date` UNION ALL
SELECT 2 AS customer, 'A' AS category, 'bb' AS room, 'd4' AS `date` UNION ALL
SELECT 2 AS customer, 'C' AS category, 'bb' AS room, 'd4' AS `date` UNION ALL
SELECT 2 AS customer, 'C' AS category, 'ee' AS room, 'd5' AS `date` UNION ALL
SELECT 3 AS customer, 'D' AS category, 'ee' AS room, 'd6' AS `date`
;
SELECT
customer_rooms.customer,
collect(customer_rooms.room, customer_rooms.date) AS map_customer_room_date,
collect(
COALESCE(customer_category_rooms.room, category_rooms.room),
COALESCE(customer_category_rooms.date, category_rooms.date)) AS map_category_room_date
FROM `table` AS customer_rooms
JOIN `table` AS category_rooms ON customer_rooms.category = category_rooms.category
LEFT OUTER JOIN `table` AS customer_category_rooms ON customer_rooms.customer = customer_category_rooms.customer
AND category_rooms.category = customer_category_rooms.category
AND category_rooms.room = customer_category_rooms.room
WHERE (
customer_rooms.customer = customer_category_rooms.customer AND
customer_rooms.category = customer_category_rooms.category AND
customer_rooms.room = customer_category_rooms.room AND
customer_rooms.date = customer_category_rooms.date
)
OR (
customer_category_rooms.customer IS NULL AND
customer_category_rooms.category IS NULL AND
customer_category_rooms.room IS NULL AND
customer_category_rooms.date IS NULL
)
GROUP BY
customer_rooms.customer
;
结果集
1 {"aa":"d1","bb":"d2","cc":"d3"} {"aa":"d1","bb":"d2","cc":"d3","ee":"d5"}
2 {"aa":"d3","bb":"d4","ee":"d5"} {"aa":"d3","bb":"d4","ee":"d5"}
3 {"ee":"d6"} {"ee":"d6"}
说明
FROM `table` AS customer_rooms
首先,结果取自最初的 table
。我们将此关系命名为 customer_rooms
。正如您在问题中已经指出的那样,这足以构建 map_customer_room_date
.
JOIN `table` AS category_rooms ON customer_rooms.category = category_rooms.category
第一个自连接识别与 customer_rooms
行中明确提及的房间具有相同类别的所有房间。我们将此关系命名为 category_rooms
.
LEFT OUTER JOIN `table` AS customer_category_rooms ON customer_rooms.customer = customer_category_rooms.customer
AND category_rooms.category = customer_category_rooms.category
AND category_rooms.room = customer_category_rooms.room
第二个自连接获取我们在 category_rooms
中标识的房间,并尝试查找该房间是否已被在 customer_rooms
中标识的客户持有。我们将此关系命名为 customer_category_rooms
。这是一个 LEFT OUTER JOIN
,因为我们要保留先前连接的所有行。结果将是 1) 来自 customer_rooms
和 customer_category_rooms
的值相同,因为客户已经拥有这个房间,或者 2) 来自 customer_category_rooms
的值将全部为 NULL
,因为客户并没有持有这个房间,而是属于同一类别的房间。这种区别将变得很重要,这样我们就可以保留客户的 date
(如果他们已经预订了房间)。
接下来,我们需要过滤。
WHERE (
customer_rooms.customer = customer_category_rooms.customer AND
customer_rooms.category = customer_category_rooms.category AND
customer_rooms.room = customer_category_rooms.room AND
customer_rooms.date = customer_category_rooms.date
)
这包括客户在原始 table
中明确持有的房间。
OR (
customer_category_rooms.customer IS NULL AND
customer_category_rooms.category IS NULL AND
customer_category_rooms.room IS NULL AND
customer_category_rooms.date IS NULL
)
这包括不是由客户持有但与客户持有的房间属于同一类别的房间。
collect(customer_rooms.room, customer_rooms.date) AS map_customer_room_date,
map_customer_room_date
可以通过从 table 收集原始数据来构建,我们将其别名为 customer_rooms
.
collect(
COALESCE(customer_category_rooms.room, category_rooms.room),
COALESCE(customer_category_rooms.date, category_rooms.date)) AS map_category_room_date
建筑 map_category_room_date
更复杂。如果客户明确保留房间,那么我们要保留那个 date
。但是,如果客户没有明确保留房间,那么我们希望能够使用具有重叠类别的另一行中的 room
和 date
。为此,我们使用 Hive COALESCE 函数来选择第一个不是 NULL
的值。如果客户已经拥有房间(如 customer_category_rooms
中的非 NULL
值所示),那么我们将使用它。如果不是,那么我们将使用 category_rooms
中的值。
请注意,如果相同的 category/room 组合可以映射到多个 date
值,仍然会存在一些歧义。如果这很重要,那么您可能需要投入更多工作来根据某些业务规则(例如,使用最快的 date
)选择正确的 date
或映射到多个 date
值而不是单一的价值。如果有这样的额外要求,这应该是一个很好的起点。
我有一个 table 看起来像这样:
|customer|category|room|date|
-----------------------------
|1 | A | aa | d1 |
|1 | A | bb | d2 |
|1 | B | cc | d3 |
|1 | C | aa | d1 |
|1 | C | bb | d2 |
|2 | A | aa | d3 |
|2 | A | bb | d4 |
|2 | C | bb | d4 |
|2 | C | ee | d5 |
|3 | D | ee | d6 |
我想用 table:
创建两张地图第一。 map_customer_room_date:将按客户分组并收集所有不同的房间(key) 和日期 (value).
我正在使用 collect()
UDF Brickhouse 函数。
这可以用类似的东西存档:
select customer, collect(room,date) as map_customer_room_date
from table
group by customer
2nd. map_category_room_date 有点复杂,也包含相同的地图类型 collect(room, date)
和它将包含所有类别的所有房间作为键,其中客户 X 是类别。
这意味着对于 customer1 它将占用空间 ee
,即使它属于 customer2。这是因为客户 1 具有类别 C
,并且该类别也存在于客户 2 中。
最终 table 按客户分组,看起来像:
|customer| map_customer_room_date | map_category_room_date |
-------------------------------------------------------------------|
| 1 |{aa: d1, bb: d2, cc: d3} |{aa: d1, bb: d2, cc: d3,ee: d6}|
| 2 |{aa: d3, bb: d4, ee: d6} |{aa: d3, bb: d4, ee: d6} |
| 3 |{ee: d6} |{ee: d6} |
我在构建第二张地图和呈现最终 table 时遇到问题,如所述。 知道如何实现吗?
这可以在将结果组合成 2 个地图之前使用一系列自连接来查找同一类别的其他房间来完成。
代码
CREATE TABLE `table` AS
SELECT 1 AS customer, 'A' AS category, 'aa' AS room, 'd1' AS `date` UNION ALL
SELECT 1 AS customer, 'A' AS category, 'bb' AS room, 'd2' AS `date` UNION ALL
SELECT 1 AS customer, 'B' AS category, 'cc' AS room, 'd3' AS `date` UNION ALL
SELECT 1 AS customer, 'C' AS category, 'aa' AS room, 'd1' AS `date` UNION ALL
SELECT 1 AS customer, 'C' AS category, 'bb' AS room, 'd2' AS `date` UNION ALL
SELECT 2 AS customer, 'A' AS category, 'aa' AS room, 'd3' AS `date` UNION ALL
SELECT 2 AS customer, 'A' AS category, 'bb' AS room, 'd4' AS `date` UNION ALL
SELECT 2 AS customer, 'C' AS category, 'bb' AS room, 'd4' AS `date` UNION ALL
SELECT 2 AS customer, 'C' AS category, 'ee' AS room, 'd5' AS `date` UNION ALL
SELECT 3 AS customer, 'D' AS category, 'ee' AS room, 'd6' AS `date`
;
SELECT
customer_rooms.customer,
collect(customer_rooms.room, customer_rooms.date) AS map_customer_room_date,
collect(
COALESCE(customer_category_rooms.room, category_rooms.room),
COALESCE(customer_category_rooms.date, category_rooms.date)) AS map_category_room_date
FROM `table` AS customer_rooms
JOIN `table` AS category_rooms ON customer_rooms.category = category_rooms.category
LEFT OUTER JOIN `table` AS customer_category_rooms ON customer_rooms.customer = customer_category_rooms.customer
AND category_rooms.category = customer_category_rooms.category
AND category_rooms.room = customer_category_rooms.room
WHERE (
customer_rooms.customer = customer_category_rooms.customer AND
customer_rooms.category = customer_category_rooms.category AND
customer_rooms.room = customer_category_rooms.room AND
customer_rooms.date = customer_category_rooms.date
)
OR (
customer_category_rooms.customer IS NULL AND
customer_category_rooms.category IS NULL AND
customer_category_rooms.room IS NULL AND
customer_category_rooms.date IS NULL
)
GROUP BY
customer_rooms.customer
;
结果集
1 {"aa":"d1","bb":"d2","cc":"d3"} {"aa":"d1","bb":"d2","cc":"d3","ee":"d5"}
2 {"aa":"d3","bb":"d4","ee":"d5"} {"aa":"d3","bb":"d4","ee":"d5"}
3 {"ee":"d6"} {"ee":"d6"}
说明
FROM `table` AS customer_rooms
首先,结果取自最初的 table
。我们将此关系命名为 customer_rooms
。正如您在问题中已经指出的那样,这足以构建 map_customer_room_date
.
JOIN `table` AS category_rooms ON customer_rooms.category = category_rooms.category
第一个自连接识别与 customer_rooms
行中明确提及的房间具有相同类别的所有房间。我们将此关系命名为 category_rooms
.
LEFT OUTER JOIN `table` AS customer_category_rooms ON customer_rooms.customer = customer_category_rooms.customer
AND category_rooms.category = customer_category_rooms.category
AND category_rooms.room = customer_category_rooms.room
第二个自连接获取我们在 category_rooms
中标识的房间,并尝试查找该房间是否已被在 customer_rooms
中标识的客户持有。我们将此关系命名为 customer_category_rooms
。这是一个 LEFT OUTER JOIN
,因为我们要保留先前连接的所有行。结果将是 1) 来自 customer_rooms
和 customer_category_rooms
的值相同,因为客户已经拥有这个房间,或者 2) 来自 customer_category_rooms
的值将全部为 NULL
,因为客户并没有持有这个房间,而是属于同一类别的房间。这种区别将变得很重要,这样我们就可以保留客户的 date
(如果他们已经预订了房间)。
接下来,我们需要过滤。
WHERE (
customer_rooms.customer = customer_category_rooms.customer AND
customer_rooms.category = customer_category_rooms.category AND
customer_rooms.room = customer_category_rooms.room AND
customer_rooms.date = customer_category_rooms.date
)
这包括客户在原始 table
中明确持有的房间。
OR (
customer_category_rooms.customer IS NULL AND
customer_category_rooms.category IS NULL AND
customer_category_rooms.room IS NULL AND
customer_category_rooms.date IS NULL
)
这包括不是由客户持有但与客户持有的房间属于同一类别的房间。
collect(customer_rooms.room, customer_rooms.date) AS map_customer_room_date,
map_customer_room_date
可以通过从 table 收集原始数据来构建,我们将其别名为 customer_rooms
.
collect(
COALESCE(customer_category_rooms.room, category_rooms.room),
COALESCE(customer_category_rooms.date, category_rooms.date)) AS map_category_room_date
建筑 map_category_room_date
更复杂。如果客户明确保留房间,那么我们要保留那个 date
。但是,如果客户没有明确保留房间,那么我们希望能够使用具有重叠类别的另一行中的 room
和 date
。为此,我们使用 Hive COALESCE 函数来选择第一个不是 NULL
的值。如果客户已经拥有房间(如 customer_category_rooms
中的非 NULL
值所示),那么我们将使用它。如果不是,那么我们将使用 category_rooms
中的值。
请注意,如果相同的 category/room 组合可以映射到多个 date
值,仍然会存在一些歧义。如果这很重要,那么您可能需要投入更多工作来根据某些业务规则(例如,使用最快的 date
)选择正确的 date
或映射到多个 date
值而不是单一的价值。如果有这样的额外要求,这应该是一个很好的起点。