使用自定义 .NET 在 Azure 数据工厂中合并两个 CSV 文件 activity
Merging two CSV Files in Azure Data Factory by using custom .NET activity
我有两个 CSV 文件,其中包含许多 n-columns.I 必须将这两个 csv 文件与一个 CSV 文件合并,该文件具有来自两个输入文件的唯一列。
我彻底浏览了所有博客,sites.All 将导致使用自定义 .NET Activity.So 我刚刚浏览 this site
但仍然无法弄清楚 C# Coding.Can 中的哪一部分有人分享了如何在 Azure 数据工厂中使用自定义 .NET Activity 合并这两个 CSV 文件的代码。
这是一个示例,说明如何使用 U-SQL 在 Zip_Code 列上连接这两个 tab-separated 文件。此示例假定这两个文件都保存在 Azure Data Lake Storage (ADLS) 中。此脚本可以很容易地合并到数据工厂管道中:
// Get raw input from file A
@inputA =
EXTRACT
Date_received string,
Product string,
Sub_product string,
Issue string,
Sub_issue string,
Consumer_complaint_narrative string,
Company_public_response string,
Company string,
State string,
ZIP_Code string,
Tags string,
Consumer_consent_provided string,
Submitted_via string,
Date_sent_to_company string,
Company_response_to_consumer string,
Timely_response string,
Consumer_disputed string,
Complaint_ID string
FROM "/input/input48A.txt"
USING Extractors.Tsv();
// Get raw input from file B
@inputB =
EXTRACT Provider_ID string,
Hospital_Name string,
Address string,
City string,
State string,
ZIP_Code string,
County_Name string,
Phone_Number string,
Hospital_Type string,
Hospital_Ownership string,
Emergency_Services string,
Meets_criteria_for_meaningful_use_of_EHRs string,
Hospital_overall_rating string,
Hospital_overall_rating_footnote string,
Mortality_national_comparison string,
Mortality_national_comparison_footnote string,
Safety_of_care_national_comparison string,
Safety_of_care_national_comparison_footnote string,
Readmission_national_comparison string,
Readmission_national_comparison_footnote string,
Patient_experience_national_comparison string,
Patient_experience_national_comparison_footnote string,
Effectiveness_of_care_national_comparison string,
Effectiveness_of_care_national_comparison_footnote string,
Timeliness_of_care_national_comparison string,
Timeliness_of_care_national_comparison_footnote string,
Efficient_use_of_medical_imaging_national_comparison string,
Efficient_use_of_medical_imaging_national_comparison_footnote string,
Location string
FROM "/input/input48B.txt"
USING Extractors.Tsv();
// Join the two files on the Zip_Code column
@output =
SELECT b.Provider_ID,
b.Hospital_Name,
b.Address,
b.City,
b.State,
b.ZIP_Code,
a.Complaint_ID
FROM @inputA AS a
INNER JOIN
@inputB AS b
ON a.ZIP_Code == b.ZIP_Code
WHERE a.ZIP_Code == "36033";
// Output the file
OUTPUT @output
TO "/output/output.txt"
USING Outputters.Tsv(quoting : false);
这也可以转换为 U-SQL 带有文件名和邮政编码参数的存储过程。
当然有很多方法可以实现这一点,每种方法各有利弊。例如,.net 自定义 activity 对于具有 .net 背景的人来说可能会更舒服,但您需要一些计算才能 运行 它。将文件导入 Azure SQL 数据库对于具有 SQL / 数据库背景并且订阅了 Azure SQL 数据库的人来说是一个不错的选择。
我有两个 CSV 文件,其中包含许多 n-columns.I 必须将这两个 csv 文件与一个 CSV 文件合并,该文件具有来自两个输入文件的唯一列。
我彻底浏览了所有博客,sites.All 将导致使用自定义 .NET Activity.So 我刚刚浏览 this site
但仍然无法弄清楚 C# Coding.Can 中的哪一部分有人分享了如何在 Azure 数据工厂中使用自定义 .NET Activity 合并这两个 CSV 文件的代码。
这是一个示例,说明如何使用 U-SQL 在 Zip_Code 列上连接这两个 tab-separated 文件。此示例假定这两个文件都保存在 Azure Data Lake Storage (ADLS) 中。此脚本可以很容易地合并到数据工厂管道中:
// Get raw input from file A
@inputA =
EXTRACT
Date_received string,
Product string,
Sub_product string,
Issue string,
Sub_issue string,
Consumer_complaint_narrative string,
Company_public_response string,
Company string,
State string,
ZIP_Code string,
Tags string,
Consumer_consent_provided string,
Submitted_via string,
Date_sent_to_company string,
Company_response_to_consumer string,
Timely_response string,
Consumer_disputed string,
Complaint_ID string
FROM "/input/input48A.txt"
USING Extractors.Tsv();
// Get raw input from file B
@inputB =
EXTRACT Provider_ID string,
Hospital_Name string,
Address string,
City string,
State string,
ZIP_Code string,
County_Name string,
Phone_Number string,
Hospital_Type string,
Hospital_Ownership string,
Emergency_Services string,
Meets_criteria_for_meaningful_use_of_EHRs string,
Hospital_overall_rating string,
Hospital_overall_rating_footnote string,
Mortality_national_comparison string,
Mortality_national_comparison_footnote string,
Safety_of_care_national_comparison string,
Safety_of_care_national_comparison_footnote string,
Readmission_national_comparison string,
Readmission_national_comparison_footnote string,
Patient_experience_national_comparison string,
Patient_experience_national_comparison_footnote string,
Effectiveness_of_care_national_comparison string,
Effectiveness_of_care_national_comparison_footnote string,
Timeliness_of_care_national_comparison string,
Timeliness_of_care_national_comparison_footnote string,
Efficient_use_of_medical_imaging_national_comparison string,
Efficient_use_of_medical_imaging_national_comparison_footnote string,
Location string
FROM "/input/input48B.txt"
USING Extractors.Tsv();
// Join the two files on the Zip_Code column
@output =
SELECT b.Provider_ID,
b.Hospital_Name,
b.Address,
b.City,
b.State,
b.ZIP_Code,
a.Complaint_ID
FROM @inputA AS a
INNER JOIN
@inputB AS b
ON a.ZIP_Code == b.ZIP_Code
WHERE a.ZIP_Code == "36033";
// Output the file
OUTPUT @output
TO "/output/output.txt"
USING Outputters.Tsv(quoting : false);
这也可以转换为 U-SQL 带有文件名和邮政编码参数的存储过程。
当然有很多方法可以实现这一点,每种方法各有利弊。例如,.net 自定义 activity 对于具有 .net 背景的人来说可能会更舒服,但您需要一些计算才能 运行 它。将文件导入 Azure SQL 数据库对于具有 SQL / 数据库背景并且订阅了 Azure SQL 数据库的人来说是一个不错的选择。