如何优化多连接作业

how to optimize multi join job

如何加快对 CSV 文件的连接?

我有一个连接 8 个文件的查询:

//please note this is the simplified query
DECLARE ... //summarizing here
FROM ...
USING Extractors.Text(delimiter : '|');

//8 more statements like the above ommitted

SELECT  one.an_episode_id, 
        one.id2_enc_kgb_id, 
        one.suffix, 
        two.suffixa, 
        three.suffixThree, 
        four.suffixFour, 
        five.suffixFive,
        six.suffixSix,
        seven.suffixSeven,
        eight.suffixEight,
        nine.suffixNine,
        ten.suffixTen  
FROM @one_files AS one
JOIN @two_files AS two 
  ON one.id3_enc_kgb_id == two.id3_enc_kgb_id
JOIN @three_files AS three  
  ON three.id3_enc_kgb_id == one.id3_enc_kgb_id
JOIN @four_files AS four
  ON four.id3_enc_kgb_id == one.id3_enc_kgb_id
JOIN @five_files AS five
  ON five.id3_enc_kgb_id == one.id3_enc_kgb_id
JOIN @six_files AS six
  ON six.id2_enc_kgb_id == one.id2_enc_kgb_id
JOIN @seven_files AS seven
  ON seven.id2_enc_kgb_id == one.id2_enc_kgb_id
JOIN @eight_files AS eight
  ON eight.id2_enc_kgb_id == one.id2_enc_kgb_id
JOIN @nine_files AS nine
  ON nine.id2_enc_kgb_id == one.id2_enc_kgb_id
JOIN @ten_files AS ten
  ON ten.id2_enc_kgb_id == one.id2_enc_kgb_id;

我将作业提交到 Azure,但几个小时后不得不取消它,费用为 80 美元!

据我了解,Data Lake 恰好 适合此类工作?!我总共可能有 100 个文件,总共可能有 20mb 的数据。

如何加快连接速度?

您需要注意的重要一点是,小文件 在每个场景中都是次优Michal Rys 对较小文件的建议解决方案是考虑将这些替代方法连接成大文件:

  • 在 Azure 之外离线
  • 事件中心捕获
  • 流分析
  • 或 ADLA 快速文件集以压缩最近的增量

注意: fast file set 允许您在单个 EXTRACT 中批量使用数十万个此类文件。

我会使用 INNER JOIN 而不是 JOIN 来确保您知道您真正使用的是哪个连接。

查看您如何从 CSV 文件中提取信息非常重要。 JOINed 结果应输出到 tsvTab-Separated-Value - Note: TVF is Table-Valued Functions 用于 u-sql 代码重用)文件。

TSV结构:

  • TSV = Tab-Separated-Value
  • 它没有 header 行
  • 每行的列数相同

这种格式对于u-sql应该是非常高效的(我还没有自己测)

要获得完整的信息,您可以使用三种不同的 build-in 输出器类型 .Text(), .Csv(), Tsv()

您的示例缺少变量,因此我将尝试猜测它们

USE DATABASE <your_database>;
USE SCHEMA <your_schema>;

DECLARE @FirstCsvFile string = "/<path>/first.csv";
@firstFile = EXTRACT an_episode_id string, id2_enc_kgb_id string, suffix string
FROM @FirstCsvFile USING Extractors.Text(delimiter : '|');

// probably 8 more statements which where omitted in the OP


@encode = SELECT  one.an_episode_id, 
                  one.id2_enc_kgb_id, 
                  one.suffix, 
                  two.suffixa, 
                  three.suffixThree, 
                  four.suffixFour, 
                  five.suffixFive,
                  six.suffixSix,
                  seven.suffixSeven,
                  eight.suffixEight,
                  nine.suffixNine,
                  ten.suffixTen  
          FROM @firstFile AS one
          INNER JOIN @two_files AS two 
            ON one.id3_enc_kgb_id == two.id3_enc_kgb_id
          INNER JOIN @three_files AS three  
            ON three.id3_enc_kgb_id == one.id3_enc_kgb_id
          INNER JOIN @four_files AS four
            ON four.id3_enc_kgb_id == one.id3_enc_kgb_id
          INNER JOIN @five_files AS five
            ON five.id3_enc_kgb_id == one.id3_enc_kgb_id
          INNER JOIN @six_files AS six
            ON six.id2_enc_kgb_id == one.id2_enc_kgb_id
          INNER JOIN @seven_files AS seven
            ON seven.id2_enc_kgb_id == one.id2_enc_kgb_id
          INNER JOIN @eight_files AS eight
            ON eight.id2_enc_kgb_id == one.id2_enc_kgb_id
          INNER JOIN @nine_files AS nine
            ON nine.id2_enc_kgb_id == one.id2_enc_kgb_id
          INNER JOIN @ten_files AS ten
            ON ten.id2_enc_kgb_id == one.id2_enc_kgb_id;
 OUTPUT @encode TO "/outputs/encode_joins.tsv" USING Outputters.Tsv();