Hive 中的窗口函数 avg with - over(按 colName 排序)
windowing function avg in Hive with - over (order by colName)
我正在尝试了解窗口函数 avg 的工作原理,但不知何故它似乎没有像我预期的那样工作。
这是数据集:
select * from winsales;
+-------------------+------------------+--------------------+-------------------+---------------+-----------------------+--+
| winsales.salesid | winsales.dateid | winsales.sellerid | winsales.buyerid | winsales.qty | winsales.qty_shipped |
+-------------------+------------------+--------------------+-------------------+---------------+-----------------------+--+
| 30001 | NULL | 3 | b | 10 | 10 |
| 10001 | NULL | 1 | c | 10 | 10 |
| 10005 | NULL | 1 | a | 30 | NULL |
| 40001 | NULL | 4 | a | 40 | NULL |
| 20001 | NULL | 2 | b | 20 | 20 |
| 40005 | NULL | 4 | a | 10 | 10 |
| 20002 | NULL | 2 | c | 20 | 20 |
| 30003 | NULL | 3 | b | 15 | NULL |
| 30004 | NULL | 3 | b | 20 | NULL |
| 30007 | NULL | 3 | c | 30 | NULL |
| 30001 | NULL | 3 | b | 10 | 10 |
+-------------------+------------------+--------------------+-------------------+---------------+-----------------------+--+
当我触发以下查询时 ->
select salesid, sellerid, qty, avg(qty) over (order by sellerid) as avg_qty from winsales order by sellerid,salesid;
我得到以下信息 ->
+----------+-----------+------+---------------------+--+
| salesid | sellerid | qty | avg_qty |
+----------+-----------+------+---------------------+--+
| 10001 | 1 | 10 | 20.0 |
| 10005 | 1 | 30 | 20.0 |
| 20001 | 2 | 20 | 20.0 |
| 20002 | 2 | 20 | 20.0 |
| 30001 | 3 | 10 | 18.333333333333332 |
| 30001 | 3 | 10 | 18.333333333333332 |
| 30003 | 3 | 15 | 18.333333333333332 |
| 30004 | 3 | 20 | 18.333333333333332 |
| 30007 | 3 | 30 | 18.333333333333332 |
| 40001 | 4 | 40 | 19.545454545454547 |
| 40005 | 4 | 10 | 19.545454545454547 |
+----------+-----------+------+---------------------+--+
问题是 - avg(qty) 是如何计算的。
因为我没有使用分区依据,所以我希望所有行的 avg(qty) 都相同。
有什么想法吗?
如果你想要相同的平均(数量)来获取所有行然后删除 order by sellerid
in over子句,那么您将拥有所有行的 19.545454545454547 值。
查询以获得所有行的相同平均值(数量):
hive> select salesid, sellerid, qty, avg(qty) over () as avg_qty from winsales order by sellerid,salesid;
如果我们在 over 子句 中包含 order by sellerid
,那么您将获得每个 sellerid 的累积平均值。
即
sellerid 1 you are having 2 records total 2 records with qty as 10,30 so avg would be
(10+30)/2.
sellerid 2 you are having 2 records total 4 records with qty as 20,20 so avg would be
(10+30+20+20)/4 = 20.0
sellerid 3 you are having 5 records total 9 records with qty as so 10,10,15,20,30 avg would be
(10+30+20+20+10+10+15+20+30)/9 = 18.333
sellerid 4 avg is 19.545454545454547
当我们包含 over 子句时,这是 hive 的预期行为。
我正在尝试了解窗口函数 avg 的工作原理,但不知何故它似乎没有像我预期的那样工作。
这是数据集:
select * from winsales; +-------------------+------------------+--------------------+-------------------+---------------+-----------------------+--+ | winsales.salesid | winsales.dateid | winsales.sellerid | winsales.buyerid | winsales.qty | winsales.qty_shipped | +-------------------+------------------+--------------------+-------------------+---------------+-----------------------+--+ | 30001 | NULL | 3 | b | 10 | 10 | | 10001 | NULL | 1 | c | 10 | 10 | | 10005 | NULL | 1 | a | 30 | NULL | | 40001 | NULL | 4 | a | 40 | NULL | | 20001 | NULL | 2 | b | 20 | 20 | | 40005 | NULL | 4 | a | 10 | 10 | | 20002 | NULL | 2 | c | 20 | 20 | | 30003 | NULL | 3 | b | 15 | NULL | | 30004 | NULL | 3 | b | 20 | NULL | | 30007 | NULL | 3 | c | 30 | NULL | | 30001 | NULL | 3 | b | 10 | 10 | +-------------------+------------------+--------------------+-------------------+---------------+-----------------------+--+
当我触发以下查询时 ->
select salesid, sellerid, qty, avg(qty) over (order by sellerid) as avg_qty from winsales order by sellerid,salesid;
我得到以下信息 ->
+----------+-----------+------+---------------------+--+
| salesid | sellerid | qty | avg_qty |
+----------+-----------+------+---------------------+--+
| 10001 | 1 | 10 | 20.0 |
| 10005 | 1 | 30 | 20.0 |
| 20001 | 2 | 20 | 20.0 |
| 20002 | 2 | 20 | 20.0 |
| 30001 | 3 | 10 | 18.333333333333332 |
| 30001 | 3 | 10 | 18.333333333333332 |
| 30003 | 3 | 15 | 18.333333333333332 |
| 30004 | 3 | 20 | 18.333333333333332 |
| 30007 | 3 | 30 | 18.333333333333332 |
| 40001 | 4 | 40 | 19.545454545454547 |
| 40005 | 4 | 10 | 19.545454545454547 |
+----------+-----------+------+---------------------+--+
问题是 - avg(qty) 是如何计算的。 因为我没有使用分区依据,所以我希望所有行的 avg(qty) 都相同。
有什么想法吗?
如果你想要相同的平均(数量)来获取所有行然后删除 order by sellerid
in over子句,那么您将拥有所有行的 19.545454545454547 值。
查询以获得所有行的相同平均值(数量):
hive> select salesid, sellerid, qty, avg(qty) over () as avg_qty from winsales order by sellerid,salesid;
如果我们在 over 子句 中包含 order by sellerid
,那么您将获得每个 sellerid 的累积平均值。
即
sellerid 1 you are having 2 records total 2 records with qty as 10,30 so avg would be
(10+30)/2.
sellerid 2 you are having 2 records total 4 records with qty as 20,20 so avg would be
(10+30+20+20)/4 = 20.0
sellerid 3 you are having 5 records total 9 records with qty as so 10,10,15,20,30 avg would be
(10+30+20+20+10+10+15+20+30)/9 = 18.333
sellerid 4 avg is 19.545454545454547
当我们包含 over 子句时,这是 hive 的预期行为。