Presto 中是否有 window 函数来计算每个商家过去 90 天的平均支出?
Is there a window function to compute average spending per merchant the last 90 days in Presto?
我想计算过去 90 天每个商家的平均支出。
我一直在用 pyspark SQL:
df_spark = df_spark.withColumn("t_unix", F.unix_timestamp(df_spark['date']))
windowSpec = Window.orderBy("t_unix").partitionBy("merchant").rangeBetween(-3 * 30 * 24 * 3600, -1)
average_spending = F.avg(df_spark['amount']).over(windowSpec)
df = df_spark.withColumn("average_spending", average_spending)
df.select('merchant', 'date', "amount", "average_spending").show(5)
+---------+-------------------+-------+----------------+
| merchant|date |amount |average_spending|
+---------+-------------------+-------+----------------+
| 26 |2017-01-01 01:11:06| 3 | null|
| 26 |2017-01-01 02:02:15| 54 | 3.0|
| 26 |2017-01-01 02:26:45| 6 | 28.5|
| 26 |2017-01-01 02:40:37| 4 | 21.0|
| 26 |2017-01-01 02:41:51| 85 | 16.75|
+---------+-------------------+-------+----------------+
only showing top 5 rows
现在我想在 AWS Athena (Presto) 中完成。
我试过下面的查询:
但我收到错误信息:
Your query has the following error(s):
SYNTAX_ERROR: line 7:24: Unexpected parameters (varchar(3), integer, varchar) for function date_add. Expected: date_add(varchar(x), bigint, date) , date_add(varchar(x), bigint, time) , date_add(varchar(x), bigint, time with time zone) , date_add(varchar(x), bigint, timestamp) , date_add(varchar(x), bigint, timestamp with time zone)
但是在 date_add('day', -90, "date") 中,我希望 "date" 作为该行的当前时间戳,而不是静态时间戳。
SELECT
"date",
"merchant",
"amount",
AVG("amount")
FROM "table"
WHERE ("date" BETWEEN date_add('day', -90, "date") and "date")
GROUP BY "merchant"
ORDER BY "date"
LIMIT 5
但我收到错误信息:
Your query has the following error(s):
SYNTAX_ERROR: line 7:24: Unexpected parameters (varchar(3), integer, varchar) for function date_add. Expected: date_add(varchar(x), bigint, date) , date_add(varchar(x), bigint, time) , date_add(varchar(x), bigint, time with time zone) , date_add(varchar(x), bigint, timestamp) , date_add(varchar(x), bigint, timestamp with time zone)
但是在 date_add('day', -90, "date") 我想要 "date" 作为 行的当前时间戳并且不是静态时间戳。
我又做了一次尝试:
SELECT
unix_date,
merchant,
amount,
AVG(amount)
OVER
( PARTITION BY merchant
ORDER BY unix_date
RANGE BETWEEN INTERVAL '90' DAY PRECEDING AND CURRENT ROW
) AVG_S
FROM ...;
但我收到错误信息:
SYNTAX_ERROR: line 5:4: Window frame start value type must be INTEGER or BIGINT(actual interval day to second)
这里有一个类似的未解决问题:
Presto SQL window aggregate looking back x hours/minutes/seconds
如果每个商家×日期有一个数据点,这可以使用 Presto's window functions
轻松完成
SELECT
date, merchant, amount,
avg(amount) OVER (
PARTITION BY merchant
ORDER BY date ASC
ROWS 89 PRECEDING) -- 89 preceding rows + the current row
FROM ...
ORDER BY date ASC -- not necessary, but you likely want the data to be sorted as well
如果每个商家和日期的数据点数量不同,您可以这样做:
SELECT
c.merchant, c.date, c.amount, avg(preceding.amount)
FROM your_table c
JOIN your_table preceding ON c.merchant = preceding.merchant
AND preceding.date BETWEEN (c.date - INTERVAL '89' DAY) AND c.date
GROUP BY c.merchant, c.date, c.amount
这对我有用。
CREATE TABLE IF NOT EXISTS full_year_query_parquet
WITH (format = 'PARQUET',
parquet_compression = 'SNAPPY',
external_location='s3://your_s3_bucket/data') AS
SELECT
a.merchant,
a.amount,
a.date,
avg(preceding.amount)
FROM "your_table" as a
JOIN "your_table" as preceding ON a.merchant = preceding.merchant
AND preceding.date > DATE_ADD('day', -90, a.date)
AND preceding.date < a.date
GROUP BY a.merchant, a.amount, a.date
我想计算过去 90 天每个商家的平均支出。 我一直在用 pyspark SQL:
df_spark = df_spark.withColumn("t_unix", F.unix_timestamp(df_spark['date']))
windowSpec = Window.orderBy("t_unix").partitionBy("merchant").rangeBetween(-3 * 30 * 24 * 3600, -1)
average_spending = F.avg(df_spark['amount']).over(windowSpec)
df = df_spark.withColumn("average_spending", average_spending)
df.select('merchant', 'date', "amount", "average_spending").show(5)
+---------+-------------------+-------+----------------+
| merchant|date |amount |average_spending|
+---------+-------------------+-------+----------------+
| 26 |2017-01-01 01:11:06| 3 | null|
| 26 |2017-01-01 02:02:15| 54 | 3.0|
| 26 |2017-01-01 02:26:45| 6 | 28.5|
| 26 |2017-01-01 02:40:37| 4 | 21.0|
| 26 |2017-01-01 02:41:51| 85 | 16.75|
+---------+-------------------+-------+----------------+
only showing top 5 rows
现在我想在 AWS Athena (Presto) 中完成。
我试过下面的查询:
但我收到错误信息:
Your query has the following error(s):
SYNTAX_ERROR: line 7:24: Unexpected parameters (varchar(3), integer, varchar) for function date_add. Expected: date_add(varchar(x), bigint, date) , date_add(varchar(x), bigint, time) , date_add(varchar(x), bigint, time with time zone) , date_add(varchar(x), bigint, timestamp) , date_add(varchar(x), bigint, timestamp with time zone)
但是在 date_add('day', -90, "date") 中,我希望 "date" 作为该行的当前时间戳,而不是静态时间戳。
SELECT
"date",
"merchant",
"amount",
AVG("amount")
FROM "table"
WHERE ("date" BETWEEN date_add('day', -90, "date") and "date")
GROUP BY "merchant"
ORDER BY "date"
LIMIT 5
但我收到错误信息:
Your query has the following error(s):
SYNTAX_ERROR: line 7:24: Unexpected parameters (varchar(3), integer, varchar) for function date_add. Expected: date_add(varchar(x), bigint, date) , date_add(varchar(x), bigint, time) , date_add(varchar(x), bigint, time with time zone) , date_add(varchar(x), bigint, timestamp) , date_add(varchar(x), bigint, timestamp with time zone)
但是在 date_add('day', -90, "date") 我想要 "date" 作为 行的当前时间戳并且不是静态时间戳。
我又做了一次尝试:
SELECT
unix_date,
merchant,
amount,
AVG(amount)
OVER
( PARTITION BY merchant
ORDER BY unix_date
RANGE BETWEEN INTERVAL '90' DAY PRECEDING AND CURRENT ROW
) AVG_S
FROM ...;
但我收到错误信息:
SYNTAX_ERROR: line 5:4: Window frame start value type must be INTEGER or BIGINT(actual interval day to second)
这里有一个类似的未解决问题: Presto SQL window aggregate looking back x hours/minutes/seconds
如果每个商家×日期有一个数据点,这可以使用 Presto's window functions
轻松完成SELECT
date, merchant, amount,
avg(amount) OVER (
PARTITION BY merchant
ORDER BY date ASC
ROWS 89 PRECEDING) -- 89 preceding rows + the current row
FROM ...
ORDER BY date ASC -- not necessary, but you likely want the data to be sorted as well
如果每个商家和日期的数据点数量不同,您可以这样做:
SELECT
c.merchant, c.date, c.amount, avg(preceding.amount)
FROM your_table c
JOIN your_table preceding ON c.merchant = preceding.merchant
AND preceding.date BETWEEN (c.date - INTERVAL '89' DAY) AND c.date
GROUP BY c.merchant, c.date, c.amount
这对我有用。
CREATE TABLE IF NOT EXISTS full_year_query_parquet
WITH (format = 'PARQUET',
parquet_compression = 'SNAPPY',
external_location='s3://your_s3_bucket/data') AS
SELECT
a.merchant,
a.amount,
a.date,
avg(preceding.amount)
FROM "your_table" as a
JOIN "your_table" as preceding ON a.merchant = preceding.merchant
AND preceding.date > DATE_ADD('day', -90, a.date)
AND preceding.date < a.date
GROUP BY a.merchant, a.amount, a.date