BQSQL基于方差比较行的解决方案
BQ SQL solution solution for comparing rows based on variance
我正在尝试比较 BigQuery 中抓取的零售商品价格数据(约 2-3B 行,具体取决于时间段和包括的零售商);目的是识别有意义的价格差异。例如,1.99 美元与 2.00 美元没有意义,但 1.99 美元与 2.50 美元是有意义的。有意义被量化为价格之间 2% 的差异。
一项的示例数据集如下所示:
ITEM Price($) Meaningful (This is the column I'm trying to flag)
Apple .99 Y (lowest price would always be flagged)
Apple .00 N (.99 v .00)
Apple .01 N (.99 v .01) Still using .99 for comparison
Apple .50 Y (.99 v .50) Still using .99 for comparison
Apple .56 Y (.50 v .56) Now using .50 as new comp. price
Apple .62 Y (.55 v .62) Now using .56 as new comp. price
我希望通过使用 SQL Window 函数(超前、滞后、分区等)将当前行的价格与下一行的价格进行比较来解决问题。但是,当我得到一个无意义的价格时,它不能正常工作,因为我总是希望将下一个值与最近的有意义的价格进行比较(参见上面的 2.50 美元行示例,与前一行中的 2.00 美元相比,而不是 2.01 美元)
我的问题:
- 是否可以在 BigQuery 中单独使用 SQL 来解决这个问题? (例如,我忽略了什么创造性的 SQL 逻辑解决方案,比如基于方差量的分桶?)
- 由于我不能将存储过程与 BQ 一起使用,我有哪些编程选项? Python/Dataframes 在 GCP 数据实验室中? BQ UDF?
以下适用于 BigQuery 标准 SQL
#standardSQL
CREATE TEMPORARY FUNCTION x(prices ARRAY<FLOAT64>)
RETURNS ARRAY<STRUCT<price FLOAT64, flag STRING>>
LANGUAGE js AS """
var result = [];
var last = 0;
var flag = '';
for (i = 0; i < prices.length; i++){
if (i == 0) {
last = prices[i];
flag = 'Y'
} else {
if ((prices[i] - last)/last > 0.02) {
last = prices[i];
flag = 'Y'
} else {flag = 'N'}
}
var rec = [];
rec.price = prices[i];
rec.flag = flag;
result.push(rec);
}
return result;
""";
SELECT item, rec.*
FROM (
SELECT item, ARRAY_AGG(price ORDER BY price) AS prices
FROM `yourTable`
GROUP BY item
), UNNEST(x(prices) ) AS rec
-- ORDER BY item, price
您可以使用您问题中的以下虚拟数据来玩/测试它
#standardSQL
CREATE TEMPORARY FUNCTION x(prices ARRAY<FLOAT64>)
RETURNS ARRAY<STRUCT<price FLOAT64, flag STRING>>
LANGUAGE js AS """
var result = [];
var last = 0;
var flag = '';
for (i = 0; i < prices.length; i++){
if (i == 0) {
last = prices[i];
flag = 'Y'
} else {
if ((prices[i] - last)/last > 0.02) {
last = prices[i];
flag = 'Y'
} else {flag = 'N'}
}
var rec = [];
rec.price = prices[i];
rec.flag = flag;
result.push(rec);
}
return result;
""";
WITH `yourTable` AS (
SELECT 'Apple' AS item, 1.99 AS price UNION ALL
SELECT 'Apple', 2.00 UNION ALL
SELECT 'Apple', 2.01 UNION ALL
SELECT 'Apple', 2.50 UNION ALL
SELECT 'Apple', 2.56 UNION ALL
SELECT 'Apple', 2.62
)
SELECT item, rec.*
FROM (
SELECT item, ARRAY_AGG(price ORDER BY price) AS prices
FROM `yourTable`
GROUP BY item
), UNNEST(x(prices) ) AS rec
ORDER BY item, price
结果如下
item price flag
---- ----- ----
Apple 1.99 Y
Apple 2.0 N
Apple 2.01 N
Apple 2.5 Y
Apple 2.56 Y
Apple 2.62 Y
我正在尝试比较 BigQuery 中抓取的零售商品价格数据(约 2-3B 行,具体取决于时间段和包括的零售商);目的是识别有意义的价格差异。例如,1.99 美元与 2.00 美元没有意义,但 1.99 美元与 2.50 美元是有意义的。有意义被量化为价格之间 2% 的差异。
一项的示例数据集如下所示:
ITEM Price($) Meaningful (This is the column I'm trying to flag)
Apple .99 Y (lowest price would always be flagged)
Apple .00 N (.99 v .00)
Apple .01 N (.99 v .01) Still using .99 for comparison
Apple .50 Y (.99 v .50) Still using .99 for comparison
Apple .56 Y (.50 v .56) Now using .50 as new comp. price
Apple .62 Y (.55 v .62) Now using .56 as new comp. price
我希望通过使用 SQL Window 函数(超前、滞后、分区等)将当前行的价格与下一行的价格进行比较来解决问题。但是,当我得到一个无意义的价格时,它不能正常工作,因为我总是希望将下一个值与最近的有意义的价格进行比较(参见上面的 2.50 美元行示例,与前一行中的 2.00 美元相比,而不是 2.01 美元)
我的问题:
- 是否可以在 BigQuery 中单独使用 SQL 来解决这个问题? (例如,我忽略了什么创造性的 SQL 逻辑解决方案,比如基于方差量的分桶?)
- 由于我不能将存储过程与 BQ 一起使用,我有哪些编程选项? Python/Dataframes 在 GCP 数据实验室中? BQ UDF?
以下适用于 BigQuery 标准 SQL
#standardSQL
CREATE TEMPORARY FUNCTION x(prices ARRAY<FLOAT64>)
RETURNS ARRAY<STRUCT<price FLOAT64, flag STRING>>
LANGUAGE js AS """
var result = [];
var last = 0;
var flag = '';
for (i = 0; i < prices.length; i++){
if (i == 0) {
last = prices[i];
flag = 'Y'
} else {
if ((prices[i] - last)/last > 0.02) {
last = prices[i];
flag = 'Y'
} else {flag = 'N'}
}
var rec = [];
rec.price = prices[i];
rec.flag = flag;
result.push(rec);
}
return result;
""";
SELECT item, rec.*
FROM (
SELECT item, ARRAY_AGG(price ORDER BY price) AS prices
FROM `yourTable`
GROUP BY item
), UNNEST(x(prices) ) AS rec
-- ORDER BY item, price
您可以使用您问题中的以下虚拟数据来玩/测试它
#standardSQL
CREATE TEMPORARY FUNCTION x(prices ARRAY<FLOAT64>)
RETURNS ARRAY<STRUCT<price FLOAT64, flag STRING>>
LANGUAGE js AS """
var result = [];
var last = 0;
var flag = '';
for (i = 0; i < prices.length; i++){
if (i == 0) {
last = prices[i];
flag = 'Y'
} else {
if ((prices[i] - last)/last > 0.02) {
last = prices[i];
flag = 'Y'
} else {flag = 'N'}
}
var rec = [];
rec.price = prices[i];
rec.flag = flag;
result.push(rec);
}
return result;
""";
WITH `yourTable` AS (
SELECT 'Apple' AS item, 1.99 AS price UNION ALL
SELECT 'Apple', 2.00 UNION ALL
SELECT 'Apple', 2.01 UNION ALL
SELECT 'Apple', 2.50 UNION ALL
SELECT 'Apple', 2.56 UNION ALL
SELECT 'Apple', 2.62
)
SELECT item, rec.*
FROM (
SELECT item, ARRAY_AGG(price ORDER BY price) AS prices
FROM `yourTable`
GROUP BY item
), UNNEST(x(prices) ) AS rec
ORDER BY item, price
结果如下
item price flag
---- ----- ----
Apple 1.99 Y
Apple 2.0 N
Apple 2.01 N
Apple 2.5 Y
Apple 2.56 Y
Apple 2.62 Y