U-SQL 之间的 JOIN 创建月度结果
U-SQL JOIN in between to create monthly results
编辑:
我一直在研究所需的结果。让我更好地解释一下:我试图达到的目的是为每个 SomeData 在开始日期和结束日期之间的每个 YearMonth 创建一行。
例如 SomeData“88888888888888888881”,起始日期
“2005-12-06 00:00:00.000”和结束日期 2006-03-13 00:00:000”。
我希望行像这样:
88888888888888888881, 200512
88888888888888888881, 200601
88888888888888888881, 200602
88888888888888888881, 200603
我知道这可能 "explode" 结果会变成一个巨大的文件。
低于我的post:
我正在尝试在 U-SQL 中重写我们之前在 T-SQL.
中所做的一些事情
问题是 U-SQL 不允许在连接期间发生中间。
T-SQL JOIN 看起来像这样:
SELECT rf.SomeData AS SomeData,
rd.YearMonth AS YearMonth,
(rf.SomeData + '-' + rd.YearMonth.ToString()) AS MonthlyKey,
rf.SomeKey AS SomeKey
FROM MyTable rf
INNER JOIN dbo.DimDate rd
ON rd.Date >= rf.StartDate
AND rd.Date <= (CASE WHEN rf.EndDate IS NULL THEN GETDATE() ELSE rf.EndDate END)
在U-SQL中我是这样开始的,现在JOIN应该怎么写呢?:
@EditedTable =
SELECT rf.SomeData AS SomeData,
rd.YearMonth AS YearMonth,
(rf.SomeData + "-" + rd.YearMonth.ToString()) AS MonthlyKey,
rf.SomeKey AS SomeKey
FROM @MyTable AS rf
INNER JOIN
@date AS rd
ON
重要的是我们获取开始日期和结束日期之间的所有数据,并创建每月密钥,以便 "SomeData" 稍后可以与另一个 table 连接。
我尝试过使用交叉连接,但是当 运行 它时,它卡在 80% 并且似乎永远不会结束。它一直在一个顶点写入 GB。此外,我实际上不确定这会产生相同的结果。
@EditedTableCROSS =
SELECT rfj.SomeData AS SomeData,
rfj.StartDate AS StartDate,
rfj.EndDate AS EndDate,
(rfj.SomeData + "-" + dtj.YearMonth.ToString()) AS MonthlyKey,
rfj.SomeKey AS SomeKey
FROM
(
SELECT SomeData AS SomeData,
StartDate AS StartDate,
EndDate AS EndDate,
SomeKey AS SomeKey
FROM @TableA
WHERE SomeData != ""
) AS rfj
CROSS JOIN
(
SELECT DISTINCT
dt.Date AS Date,
dt.YearMonth AS YearMonth,
dt.Month AS Month,
rf.StartDate AS StartDate
FROM @date AS dt INNER JOIN @TableA AS rf ON rf.StartDate == dt.Date
WHERE rf.StartDate >= dt.Date AND
dt.Date <= DateTime.Now
) AS dtj
WHERE rfj.StartDate <= dtj.Date AND
rfj.EndDate >= dtj.Date;
上面代码的问题是 "INNER JOIN @TableA AS rf ON rf.StartDate == dt.Date" 不在唯一键上,某些日期不止一次出现。所以我怀疑这是不是这样....
请分享您的想法?
编辑:
人们要求提供样本数据,结束日期可能包含:
2006-03-13 10:27:13.000
2016-03-02 18:48:11.000
2016-03-02 18:42:57.000
NULL
2013-09-12 09:19:05.000
NULL
2016-03-02 18:59:37.000
NULL
NULL
开始日期:
2005-12-06 00:00:00.000
2011-03-29 20:57:51.000
2007-11-01 00:00:00.000
2007-11-01 00:00:00.000
2007-11-01 00:00:00.000
2011-02-28 00:00:00.000
2011-02-28 00:00:00.000
2011-02-28 00:00:00.000
2008-01-17 00:00:00.000
DimDate 包含从 2000 年到 2018 年的日级别日期。
SomeDate 和 SomeKey 看起来像:
88888888888888888881
88888888888888888882
88888888888888888883
88888888888888888884
88888888888888888885
88888888888888888886
88888888888888888887
88888888888888888888
88888888888888888889
我得到了这个脚本来处理我生成的一些示例数据。
@dateDim =
EXTRACT xdate DateTime,
yearMonth string
FROM "/input/dbo.DimDate.tsv"
USING Extractors.Tsv();
@data =
EXTRACT
someKey int,
someData string,
startDate DateTime,
endDate DateTime?
FROM "/input/dbo.MyTable.tsv"
USING Extractors.Tsv();
/*
// Use U-SQL ISNULL conditional operator which is ?
@working =
SELECT COUNT( * ) AS records
FROM
(
SELECT *
FROM @dateDim AS dd
CROSS JOIN
@data AS d
WHERE dd.xdate BETWEEN d.startDate AND (d.endDate == (DateTime?)null ? DateTime.Now : d.endDate)
) AS x;
*/
@working =
SELECT COUNT( * ) AS records
FROM
(
SELECT *
FROM @dateDim AS dd
CROSS JOIN
@data AS d
WHERE dd.xdate >= d.startDate
AND dd.xdate <= (d.endDate == (DateTime?)null ? DateTime.Now : d.endDate)
) AS x;
U-SQL 不支持谓词中的 BETWEEN 的原因是没有适用于非等值连接的横向扩展连接算法。即使我们在语法上允许它,它仍然会进入计划中的 CROSS JOIN。
您想做的是获得一个可以分区的连接。一种方法是,如果您可以在分区键上进行相等连接,然后在该分区内进行交叉连接。
然而,在你的情况下,我认为你并不真的需要加入。我认为您想要做的是在开始日期和结束日期之间每天生成一行。
我会用没有比例限制的 CROSS APPLY EXPLODE
来做到这一点。这是一个例子:
@MyTable =
SELECT *
FROM (VALUES
(81,81,(DateTime?) DateTime.Parse("2005-12-06 00:00:00.000"),(DateTime?) DateTime.Parse("2006-03-13 10:27:13.000")),
(82,82,(DateTime?) DateTime.Parse("2011-03-29 20:57:51.000"),(DateTime?) DateTime.Parse("2016-03-02 18:48:11.000")),
(83,83,(DateTime?) DateTime.Parse("2007-11-01 00:00:00.000"),(DateTime?) DateTime.Parse("2016-03-02 18:42:57.000")),
(84,84,(DateTime?) DateTime.Parse("2007-11-01 00:00:00.000"),(DateTime?) null),
(85,85,(DateTime?) DateTime.Parse("2007-11-01 00:00:00.000"),(DateTime?) DateTime.Parse("2013-09-12 09:19:05.000")),
(86,86,(DateTime?) DateTime.Parse("2011-02-28 00:00:00.000"),(DateTime?) null),
(87,87,(DateTime?) DateTime.Parse("2011-02-28 00:00:00.000"),(DateTime?) DateTime.Parse("2016-03-02 18:59:37.000")),
(88,88,(DateTime?) DateTime.Parse("2011-02-28 00:00:00.000"),(DateTime?) null),
(89,89,(DateTime?) DateTime.Parse("2008-01-17 00:00:00.000"),(DateTime?) null)
) AS T(SomeKey, SomeData, StartDate, EndDate);
@res =
SELECT SomeKey, SomeData, StartDate, EndDate, DailyDate
FROM @MyTable
CROSS APPLY EXPLODE
(Enumerable.Range(0,
1 + (EndDate == (DateTime?) null ? DateTime.Now
: EndDate.Value).Subtract(StartDate.Value).Days)
.Select(offset => StartDate.Value.AddDays(offset))
) AS T(DailyDate);
OUTPUT @res
TO "/output/test.csv"
USING Outputters.Csv(outputHeader : true);
这是一个典型的例子,说明根据场景提问比要求翻译更容易回答 :)。
编辑: 我一直在研究所需的结果。让我更好地解释一下:我试图达到的目的是为每个 SomeData 在开始日期和结束日期之间的每个 YearMonth 创建一行。
例如 SomeData“88888888888888888881”,起始日期 “2005-12-06 00:00:00.000”和结束日期 2006-03-13 00:00:000”。 我希望行像这样:
88888888888888888881, 200512
88888888888888888881, 200601
88888888888888888881, 200602
88888888888888888881, 200603
我知道这可能 "explode" 结果会变成一个巨大的文件。
低于我的post:
我正在尝试在 U-SQL 中重写我们之前在 T-SQL.
中所做的一些事情问题是 U-SQL 不允许在连接期间发生中间。
T-SQL JOIN 看起来像这样:
SELECT rf.SomeData AS SomeData,
rd.YearMonth AS YearMonth,
(rf.SomeData + '-' + rd.YearMonth.ToString()) AS MonthlyKey,
rf.SomeKey AS SomeKey
FROM MyTable rf
INNER JOIN dbo.DimDate rd
ON rd.Date >= rf.StartDate
AND rd.Date <= (CASE WHEN rf.EndDate IS NULL THEN GETDATE() ELSE rf.EndDate END)
在U-SQL中我是这样开始的,现在JOIN应该怎么写呢?:
@EditedTable =
SELECT rf.SomeData AS SomeData,
rd.YearMonth AS YearMonth,
(rf.SomeData + "-" + rd.YearMonth.ToString()) AS MonthlyKey,
rf.SomeKey AS SomeKey
FROM @MyTable AS rf
INNER JOIN
@date AS rd
ON
重要的是我们获取开始日期和结束日期之间的所有数据,并创建每月密钥,以便 "SomeData" 稍后可以与另一个 table 连接。
我尝试过使用交叉连接,但是当 运行 它时,它卡在 80% 并且似乎永远不会结束。它一直在一个顶点写入 GB。此外,我实际上不确定这会产生相同的结果。
@EditedTableCROSS =
SELECT rfj.SomeData AS SomeData,
rfj.StartDate AS StartDate,
rfj.EndDate AS EndDate,
(rfj.SomeData + "-" + dtj.YearMonth.ToString()) AS MonthlyKey,
rfj.SomeKey AS SomeKey
FROM
(
SELECT SomeData AS SomeData,
StartDate AS StartDate,
EndDate AS EndDate,
SomeKey AS SomeKey
FROM @TableA
WHERE SomeData != ""
) AS rfj
CROSS JOIN
(
SELECT DISTINCT
dt.Date AS Date,
dt.YearMonth AS YearMonth,
dt.Month AS Month,
rf.StartDate AS StartDate
FROM @date AS dt INNER JOIN @TableA AS rf ON rf.StartDate == dt.Date
WHERE rf.StartDate >= dt.Date AND
dt.Date <= DateTime.Now
) AS dtj
WHERE rfj.StartDate <= dtj.Date AND
rfj.EndDate >= dtj.Date;
上面代码的问题是 "INNER JOIN @TableA AS rf ON rf.StartDate == dt.Date" 不在唯一键上,某些日期不止一次出现。所以我怀疑这是不是这样....
请分享您的想法?
编辑: 人们要求提供样本数据,结束日期可能包含:
2006-03-13 10:27:13.000
2016-03-02 18:48:11.000
2016-03-02 18:42:57.000
NULL
2013-09-12 09:19:05.000
NULL
2016-03-02 18:59:37.000
NULL
NULL
开始日期:
2005-12-06 00:00:00.000
2011-03-29 20:57:51.000
2007-11-01 00:00:00.000
2007-11-01 00:00:00.000
2007-11-01 00:00:00.000
2011-02-28 00:00:00.000
2011-02-28 00:00:00.000
2011-02-28 00:00:00.000
2008-01-17 00:00:00.000
DimDate 包含从 2000 年到 2018 年的日级别日期。
SomeDate 和 SomeKey 看起来像:
88888888888888888881
88888888888888888882
88888888888888888883
88888888888888888884
88888888888888888885
88888888888888888886
88888888888888888887
88888888888888888888
88888888888888888889
我得到了这个脚本来处理我生成的一些示例数据。
@dateDim =
EXTRACT xdate DateTime,
yearMonth string
FROM "/input/dbo.DimDate.tsv"
USING Extractors.Tsv();
@data =
EXTRACT
someKey int,
someData string,
startDate DateTime,
endDate DateTime?
FROM "/input/dbo.MyTable.tsv"
USING Extractors.Tsv();
/*
// Use U-SQL ISNULL conditional operator which is ?
@working =
SELECT COUNT( * ) AS records
FROM
(
SELECT *
FROM @dateDim AS dd
CROSS JOIN
@data AS d
WHERE dd.xdate BETWEEN d.startDate AND (d.endDate == (DateTime?)null ? DateTime.Now : d.endDate)
) AS x;
*/
@working =
SELECT COUNT( * ) AS records
FROM
(
SELECT *
FROM @dateDim AS dd
CROSS JOIN
@data AS d
WHERE dd.xdate >= d.startDate
AND dd.xdate <= (d.endDate == (DateTime?)null ? DateTime.Now : d.endDate)
) AS x;
U-SQL 不支持谓词中的 BETWEEN 的原因是没有适用于非等值连接的横向扩展连接算法。即使我们在语法上允许它,它仍然会进入计划中的 CROSS JOIN。
您想做的是获得一个可以分区的连接。一种方法是,如果您可以在分区键上进行相等连接,然后在该分区内进行交叉连接。
然而,在你的情况下,我认为你并不真的需要加入。我认为您想要做的是在开始日期和结束日期之间每天生成一行。
我会用没有比例限制的 CROSS APPLY EXPLODE
来做到这一点。这是一个例子:
@MyTable =
SELECT *
FROM (VALUES
(81,81,(DateTime?) DateTime.Parse("2005-12-06 00:00:00.000"),(DateTime?) DateTime.Parse("2006-03-13 10:27:13.000")),
(82,82,(DateTime?) DateTime.Parse("2011-03-29 20:57:51.000"),(DateTime?) DateTime.Parse("2016-03-02 18:48:11.000")),
(83,83,(DateTime?) DateTime.Parse("2007-11-01 00:00:00.000"),(DateTime?) DateTime.Parse("2016-03-02 18:42:57.000")),
(84,84,(DateTime?) DateTime.Parse("2007-11-01 00:00:00.000"),(DateTime?) null),
(85,85,(DateTime?) DateTime.Parse("2007-11-01 00:00:00.000"),(DateTime?) DateTime.Parse("2013-09-12 09:19:05.000")),
(86,86,(DateTime?) DateTime.Parse("2011-02-28 00:00:00.000"),(DateTime?) null),
(87,87,(DateTime?) DateTime.Parse("2011-02-28 00:00:00.000"),(DateTime?) DateTime.Parse("2016-03-02 18:59:37.000")),
(88,88,(DateTime?) DateTime.Parse("2011-02-28 00:00:00.000"),(DateTime?) null),
(89,89,(DateTime?) DateTime.Parse("2008-01-17 00:00:00.000"),(DateTime?) null)
) AS T(SomeKey, SomeData, StartDate, EndDate);
@res =
SELECT SomeKey, SomeData, StartDate, EndDate, DailyDate
FROM @MyTable
CROSS APPLY EXPLODE
(Enumerable.Range(0,
1 + (EndDate == (DateTime?) null ? DateTime.Now
: EndDate.Value).Subtract(StartDate.Value).Days)
.Select(offset => StartDate.Value.AddDays(offset))
) AS T(DailyDate);
OUTPUT @res
TO "/output/test.csv"
USING Outputters.Csv(outputHeader : true);
这是一个典型的例子,说明根据场景提问比要求翻译更容易回答 :)。