使用自定义 .NET 在 Azure 数据工厂中合并两个 CSV 文件 activity

Merging two CSV Files in Azure Data Factory by using custom .NET activity

我有两个 CSV 文件,其中包含许多 n-columns.I 必须将这两个 csv 文件与一个 CSV 文件合并,该文件具有来自两个输入文件的唯一列。

我彻底浏览了所有博客,sites.All 将导致使用自定义 .NET Activity.So 我刚刚浏览 this site

但仍然无法弄清楚 C# Coding.Can 中的哪一部分有人分享了如何在 Azure 数据工厂中使用自定义 .NET Activity 合并这两个 CSV 文件的代码。

这是一个示例,说明如何使用 U-SQL 在 Zip_Code 列上连接这两个 tab-separated 文件。此示例假定这两个文件都保存在 Azure Data Lake Storage (ADLS) 中。此脚本可以很容易地合并到数据工厂管道中:

// Get raw input from file A
@inputA =
    EXTRACT 
        Date_received   string,
        Product string,
        Sub_product string,
        Issue   string,
        Sub_issue   string,
        Consumer_complaint_narrative    string,
        Company_public_response string,
        Company string,
        State   string,
        ZIP_Code    string,
        Tags    string,
        Consumer_consent_provided   string,
        Submitted_via   string,
        Date_sent_to_company    string,
        Company_response_to_consumer    string,
        Timely_response string,
        Consumer_disputed   string,
        Complaint_ID    string

    FROM "/input/input48A.txt"
    USING Extractors.Tsv();


// Get raw input from file B
@inputB =
    EXTRACT Provider_ID string,
            Hospital_Name string,
            Address string,
            City string,
            State string,
            ZIP_Code string,
            County_Name string,
            Phone_Number string,
            Hospital_Type string,
            Hospital_Ownership string,
            Emergency_Services string,
            Meets_criteria_for_meaningful_use_of_EHRs string,
            Hospital_overall_rating string,
            Hospital_overall_rating_footnote string,
            Mortality_national_comparison string,
            Mortality_national_comparison_footnote string,
            Safety_of_care_national_comparison string,
            Safety_of_care_national_comparison_footnote string,
            Readmission_national_comparison string,
            Readmission_national_comparison_footnote string,
            Patient_experience_national_comparison string,
            Patient_experience_national_comparison_footnote string,
            Effectiveness_of_care_national_comparison string,
            Effectiveness_of_care_national_comparison_footnote string,
            Timeliness_of_care_national_comparison string,
            Timeliness_of_care_national_comparison_footnote string,
            Efficient_use_of_medical_imaging_national_comparison string,
            Efficient_use_of_medical_imaging_national_comparison_footnote string,
            Location string

    FROM "/input/input48B.txt"
    USING Extractors.Tsv();


// Join the two files on the Zip_Code column
@output =
    SELECT b.Provider_ID,
           b.Hospital_Name,
           b.Address,
           b.City,
           b.State,
           b.ZIP_Code,
           a.Complaint_ID

    FROM @inputA AS a
         INNER JOIN
             @inputB AS b
         ON a.ZIP_Code == b.ZIP_Code
    WHERE a.ZIP_Code == "36033";


// Output the file
OUTPUT @output
TO "/output/output.txt"
USING Outputters.Tsv(quoting : false);

这也可以转换为 U-SQL 带有文件名和邮政编码参数的存储过程。

当然有很多方法可以实现这一点,每种方法各有利弊。例如,.net 自定义 activity 对于具有 .net 背景的人来说可能会更舒服,但您需要一些计算才能 运行 它。将文件导入 Azure SQL 数据库对于具有 SQL / 数据库背景并且订阅了 Azure SQL 数据库的人来说是一个不错的选择。