使用 Azure 数据工厂将压缩的 XML 文件从 HTTP Link 源复制并提取到 Azure Blob 存储

Question

我正在尝试建立 Azure 数据工厂复制数据管道。该源是一个开放的 HTTP 链接源（Url 参考：https://clinicaltrials.gov/AllPublicXML.zip). So basically the source contains a zipped folder having many XML files. I want to unzip and save the extracted XML files in Azure Blob Storage using Azure Data Factory. I was trying to follow the configurations mentioned here: How to decompress a zip file in Azure Data Factory v2 但我收到以下错误：

ErrorCode=UserErrorSourceNotSeekable,'Type=Microsoft.DataTransfer.Common.Shared.HybridDeliveryException,Message=Your HttpServer source can't support random read which is requied by current copy activity setting, please create two copy activities to work around it: the first copy activity binary copy your HttpServer source to a staging file store(like Azure Blob, Azure Data Lake, File, etc.), second copy activity copy from the staged file store to your destination with current settings.,Source=Microsoft.DataTransfer.ClientLibrary,'

不太确定哪里出了问题，但如果有人能指导我完成该过程，那将非常有帮助。

Answer 1

我将其分解为两个复制数据活动，以便将 zip 文件（相当大）的下载和解压缩分开。您可以尝试一步完成，但我认为您将运行陷入超时问题。使用我的方法，您还可以获得原始 zip 文件的副本，这将有助于审计跟踪和调试目的。

我尝试以方框和线条的格式记录我的 ADF 模式，其中显示了每个组件的关键细节。所以这里有两个复制活动，以及支持的链接服务和数据集 - 尝试并遵循这个，让我知道你的进展情况：

注意 ADF 解压 .xml 文件花费了相当长的时间，因为它们相当多。我在 Azure 存储资源管理器中显示的结果：

使用 Azure 数据工厂将压缩的 XML 文件从 HTTP Link 源复制并提取到 Azure Blob 存储

Copy and Extracting Zipped XML files from HTTP Link Source to Azure Blob Storage using Azure Data Factory

azure

azure-data-factory

azure-data-lake

data-pipeline

azure-data-factory-2