如何优化多连接作业
how to optimize multi join job
如何加快对 CSV 文件的连接?
我有一个连接 8 个文件的查询:
//please note this is the simplified query
DECLARE ... //summarizing here
FROM ...
USING Extractors.Text(delimiter : '|');
//8 more statements like the above ommitted
SELECT one.an_episode_id,
one.id2_enc_kgb_id,
one.suffix,
two.suffixa,
three.suffixThree,
four.suffixFour,
five.suffixFive,
six.suffixSix,
seven.suffixSeven,
eight.suffixEight,
nine.suffixNine,
ten.suffixTen
FROM @one_files AS one
JOIN @two_files AS two
ON one.id3_enc_kgb_id == two.id3_enc_kgb_id
JOIN @three_files AS three
ON three.id3_enc_kgb_id == one.id3_enc_kgb_id
JOIN @four_files AS four
ON four.id3_enc_kgb_id == one.id3_enc_kgb_id
JOIN @five_files AS five
ON five.id3_enc_kgb_id == one.id3_enc_kgb_id
JOIN @six_files AS six
ON six.id2_enc_kgb_id == one.id2_enc_kgb_id
JOIN @seven_files AS seven
ON seven.id2_enc_kgb_id == one.id2_enc_kgb_id
JOIN @eight_files AS eight
ON eight.id2_enc_kgb_id == one.id2_enc_kgb_id
JOIN @nine_files AS nine
ON nine.id2_enc_kgb_id == one.id2_enc_kgb_id
JOIN @ten_files AS ten
ON ten.id2_enc_kgb_id == one.id2_enc_kgb_id;
我将作业提交到 Azure,但几个小时后不得不取消它,费用为 80 美元!
据我了解,Data Lake 恰好 适合此类工作?!我总共可能有 100 个文件,总共可能有 20mb 的数据。
如何加快连接速度?
您需要注意的重要一点是,小文件 在每个场景中都是次优。 Michal Rys 对较小文件的建议解决方案是考虑将这些替代方法连接成大文件:
- 在 Azure 之外离线
- 事件中心捕获
- 流分析
- 或 ADLA 快速文件集以压缩最近的增量
注意:
fast file set
允许您在单个 EXTRACT
中批量使用数十万个此类文件。
我会使用 INNER JOIN
而不是 JOIN
来确保您知道您真正使用的是哪个连接。
查看您如何从 CSV 文件中提取信息非常重要。 JOINed 结果应输出到 tsv(Tab-Separated-Value - Note: TVF is Table-Valued Functions 用于 u-sql 代码重用)文件。
TSV结构:
- TSV = Tab-Separated-Value
- 它没有 header 行
- 每行的列数相同
这种格式对于u-sql应该是非常高效的(我还没有自己测)
要获得完整的信息,您可以使用三种不同的 build-in 输出器类型 .Text(), .Csv(), Tsv()。
您的示例缺少变量,因此我将尝试猜测它们
USE DATABASE <your_database>;
USE SCHEMA <your_schema>;
DECLARE @FirstCsvFile string = "/<path>/first.csv";
@firstFile = EXTRACT an_episode_id string, id2_enc_kgb_id string, suffix string
FROM @FirstCsvFile USING Extractors.Text(delimiter : '|');
// probably 8 more statements which where omitted in the OP
@encode = SELECT one.an_episode_id,
one.id2_enc_kgb_id,
one.suffix,
two.suffixa,
three.suffixThree,
four.suffixFour,
five.suffixFive,
six.suffixSix,
seven.suffixSeven,
eight.suffixEight,
nine.suffixNine,
ten.suffixTen
FROM @firstFile AS one
INNER JOIN @two_files AS two
ON one.id3_enc_kgb_id == two.id3_enc_kgb_id
INNER JOIN @three_files AS three
ON three.id3_enc_kgb_id == one.id3_enc_kgb_id
INNER JOIN @four_files AS four
ON four.id3_enc_kgb_id == one.id3_enc_kgb_id
INNER JOIN @five_files AS five
ON five.id3_enc_kgb_id == one.id3_enc_kgb_id
INNER JOIN @six_files AS six
ON six.id2_enc_kgb_id == one.id2_enc_kgb_id
INNER JOIN @seven_files AS seven
ON seven.id2_enc_kgb_id == one.id2_enc_kgb_id
INNER JOIN @eight_files AS eight
ON eight.id2_enc_kgb_id == one.id2_enc_kgb_id
INNER JOIN @nine_files AS nine
ON nine.id2_enc_kgb_id == one.id2_enc_kgb_id
INNER JOIN @ten_files AS ten
ON ten.id2_enc_kgb_id == one.id2_enc_kgb_id;
OUTPUT @encode TO "/outputs/encode_joins.tsv" USING Outputters.Tsv();
如何加快对 CSV 文件的连接?
我有一个连接 8 个文件的查询:
//please note this is the simplified query
DECLARE ... //summarizing here
FROM ...
USING Extractors.Text(delimiter : '|');
//8 more statements like the above ommitted
SELECT one.an_episode_id,
one.id2_enc_kgb_id,
one.suffix,
two.suffixa,
three.suffixThree,
four.suffixFour,
five.suffixFive,
six.suffixSix,
seven.suffixSeven,
eight.suffixEight,
nine.suffixNine,
ten.suffixTen
FROM @one_files AS one
JOIN @two_files AS two
ON one.id3_enc_kgb_id == two.id3_enc_kgb_id
JOIN @three_files AS three
ON three.id3_enc_kgb_id == one.id3_enc_kgb_id
JOIN @four_files AS four
ON four.id3_enc_kgb_id == one.id3_enc_kgb_id
JOIN @five_files AS five
ON five.id3_enc_kgb_id == one.id3_enc_kgb_id
JOIN @six_files AS six
ON six.id2_enc_kgb_id == one.id2_enc_kgb_id
JOIN @seven_files AS seven
ON seven.id2_enc_kgb_id == one.id2_enc_kgb_id
JOIN @eight_files AS eight
ON eight.id2_enc_kgb_id == one.id2_enc_kgb_id
JOIN @nine_files AS nine
ON nine.id2_enc_kgb_id == one.id2_enc_kgb_id
JOIN @ten_files AS ten
ON ten.id2_enc_kgb_id == one.id2_enc_kgb_id;
我将作业提交到 Azure,但几个小时后不得不取消它,费用为 80 美元!
据我了解,Data Lake 恰好 适合此类工作?!我总共可能有 100 个文件,总共可能有 20mb 的数据。
如何加快连接速度?
您需要注意的重要一点是,小文件 在每个场景中都是次优。 Michal Rys 对较小文件的建议解决方案是考虑将这些替代方法连接成大文件:
- 在 Azure 之外离线
- 事件中心捕获
- 流分析
- 或 ADLA 快速文件集以压缩最近的增量
注意:
fast file set
允许您在单个 EXTRACT
中批量使用数十万个此类文件。
我会使用 INNER JOIN
而不是 JOIN
来确保您知道您真正使用的是哪个连接。
查看您如何从 CSV 文件中提取信息非常重要。 JOINed 结果应输出到 tsv(Tab-Separated-Value - Note: TVF is Table-Valued Functions 用于 u-sql 代码重用)文件。
TSV结构:
- TSV = Tab-Separated-Value
- 它没有 header 行
- 每行的列数相同
这种格式对于u-sql应该是非常高效的(我还没有自己测)
要获得完整的信息,您可以使用三种不同的 build-in 输出器类型 .Text(), .Csv(), Tsv()。
您的示例缺少变量,因此我将尝试猜测它们
USE DATABASE <your_database>;
USE SCHEMA <your_schema>;
DECLARE @FirstCsvFile string = "/<path>/first.csv";
@firstFile = EXTRACT an_episode_id string, id2_enc_kgb_id string, suffix string
FROM @FirstCsvFile USING Extractors.Text(delimiter : '|');
// probably 8 more statements which where omitted in the OP
@encode = SELECT one.an_episode_id,
one.id2_enc_kgb_id,
one.suffix,
two.suffixa,
three.suffixThree,
four.suffixFour,
five.suffixFive,
six.suffixSix,
seven.suffixSeven,
eight.suffixEight,
nine.suffixNine,
ten.suffixTen
FROM @firstFile AS one
INNER JOIN @two_files AS two
ON one.id3_enc_kgb_id == two.id3_enc_kgb_id
INNER JOIN @three_files AS three
ON three.id3_enc_kgb_id == one.id3_enc_kgb_id
INNER JOIN @four_files AS four
ON four.id3_enc_kgb_id == one.id3_enc_kgb_id
INNER JOIN @five_files AS five
ON five.id3_enc_kgb_id == one.id3_enc_kgb_id
INNER JOIN @six_files AS six
ON six.id2_enc_kgb_id == one.id2_enc_kgb_id
INNER JOIN @seven_files AS seven
ON seven.id2_enc_kgb_id == one.id2_enc_kgb_id
INNER JOIN @eight_files AS eight
ON eight.id2_enc_kgb_id == one.id2_enc_kgb_id
INNER JOIN @nine_files AS nine
ON nine.id2_enc_kgb_id == one.id2_enc_kgb_id
INNER JOIN @ten_files AS ten
ON ten.id2_enc_kgb_id == one.id2_enc_kgb_id;
OUTPUT @encode TO "/outputs/encode_joins.tsv" USING Outputters.Tsv();