SQL - 如何 Select SQL 中两个字符串行之间的所有行

Question

这是我的完整任务描述：

我必须使用 u-sql 从多个文件中提取数据并将其输出到 csv 文件中。每个输入文件都包含基于某些字符串行的多个报告（"START OF ..." 和 "END OF ..." 作为报告分隔符）。这是单个源（输入）文件的示例（数据格式）：

START OF DAILY ACCOUNT
some data 1
some data 2
some data 3
some data n
END OF DAILY ACCOUNT
START OF LEDGER BALANCE
some data 1
some data 2
some data 3
some data 4
some data 5
some data n
END OF LEDGER BALANCE
START OF DAILY SUMMARY REPORT
some data 1
some data 2
some data 3
some data n
END OF DAILY SUMMARY REPORT

所以现在我的问题是如何获取所有文件的 "START OF ..." 和 "END OF ..." 行之间的记录？

最后我想要这样的东西：

@dailyAccountResult = [select all rows between "START OF DAILY ACCOUNT" and "END OF DAILY ACCOUNT" rows]

@ledgerBalanceResult = [select all rows between "START OF LEDGER BALANCE" and "END OF LEDGER BALANCE" rows]

@dailySummaryReportResult = [select all rows between "START OF DAILY SUMMARY REPORT" and "END OF DAILY SUMMARY REPORT" rows]

我需要为此编写自定义提取器吗？如果是，请告诉我怎么做。

Answer 1

我认为这可以使用普通的 U-SQL 而无需自定义提取器。我根据您的 sample data:

创建了一个简单示例

// Get raw input
@input =
    EXTRACT rawData string
    FROM "/input/input36.txt"
    USING Extractors.Tsv();


// Add a row number and break out the section;
// Get all [START OF ...] and [END OF ...] blocks and pair them.
// !!WARNING code assumes there are no duplicate sections, ie can not be more than one DAILY ACCOUNT section for example
@working =
    SELECT ROW_NUMBER() OVER() AS rn,
           System.Text.RegularExpressions.Regex.Match(rawData, "(START OF|END OF) (?<sectionName>.+)").Groups["sectionName"].ToString() AS sectionName,
           *
    FROM @input;


// Work out the section boundaries
@sections =
    SELECT sectionName,
           MIN(rn) AS startRn,
           MAX(rn) AS endRn,
           COUNT( * ) AS records
    FROM @working
    WHERE sectionName != ""
    GROUP BY sectionName;


// Create the output
@output =
    SELECT s.sectionName,
           i.rn == s.startRn ? 1 : 0 AS isStartSection,
           i.rn == s.endRn ? 1 : 0 AS isEndSection,
           i.rawData
    FROM @sections AS s
         CROSS JOIN
             @working AS i
    WHERE i.rn BETWEEN s.startRn AND s.endRn;


// Output the file
OUTPUT @output
TO "/output/output.txt"
USING Outputters.Tsv(quoting : false);

我的结果：

现在每个部分都标有部分名称，您可以轻松地将数据分配给不同的变量，并可选择包含 header/footer 行，例如

@dailyAccount =
    SELECT rawData
    FROM @output
    WHERE sectionName == "DAILY ACCOUNT"
          AND isStartSection == 0
          AND isEndSection == 0;

试一试，告诉我你的进展情况。

Answer 2

SQL - 如何 Select SQL 中两个字符串行之间的所有行

USQL - How To Select All Rows Between Two String Rows in USQL

c#

azure-sql-database

azure-data-lake

u-sql