检查所有文件是否在存储中可用 - Azure ADF

Question

在 Azure 数据工厂中，如何检查字符串数组（文件名）是否包含值？

我正在从获取元数据中获取文件名 activity，我需要在继续之前检查我拥有的所有 4 个文件名是否在存储帐户中可用。

我希望存储帐户中有 4 个文件，我需要检查这 4 个文件是否都可用。我需要明确检查文件名而不是文件数 - 这是一项要求

当我尝试使用获取元数据中的子项目验证它时出现错误"array elements can only be selected using an integer index."这里的问题是文件可能出现在下一次加载的任何索引处

是否有更好的方法来验证文件名？

感谢您的帮助，在此先致谢

Answer 1

我的获取元数据输出如下所示

 "childItems": [
    {
        "name": "1.py",
        "type": "File"
    },
    {
        "name": "SalesData.numbers",
        "type": "File"
    },
    {
        "name": "file1.txt",
        "type": "File"
    }

]

并且我在设置变量 activity 中使用了以下表达式来检查文件名

@if(
contains(activity('Get Metadata1').output.childitems,
json(concat('{"name":"file1.txt"',',','"type":"File"}'))), 

if(
contains(activity('Get Metadata1').output.childitems,
json(concat('{"name":"file2.txt"',',','"type":"File"}'))),

if(
contains(activity('Get Metadata1').output.childitems,
json(concat('{"name":"2.py"',',','"type":"File"}'))),'yes','no')
,'no')
,'no')

这会检查我的 blob 是否有 file1.txt、file2.txt 和 2.py

如果是，我将把 yes 赋给变量 else no

您也可以使用 if 条件

Answer 2

你能试试这个吗(Python)？

import fnmatch
import os
 
rootPath = '/'
pattern = '*.mp3'
 
for root, dirs, files in os.walk(rootPath):
    for filename in fnmatch.filter(files, pattern):
        print( os.path.join(root, filename))

Answer 3

可以使用数组检查是否存在多个文件，但有点繁琐。我经常将其传递给管道中的另一个 activity，例如存储过程或笔记本 activity，具体取决于管道中可用的计算（例如 SQL 数据库或 Spark 集群）。但是，如果您确实需要在管道中执行此操作，这可能适合您。

首先，我有一个具有以下值的数组参数：

Parameter Name	Parameter Type	Parameter Value
pFilesToCheck	Array	["json1.json","json2.json","json3.json","json4.json"]

这些是必须存在的文件。接下来我有一个 Get Metadata activity 指向一个数据湖文件夹，在字段列表中设置了子项参数：

这将 return 这种格式的一些输出，列出给定目录中的所有文件，以及一些关于执行的附加信息：

{
    "childItems": [
        {
            "name": "json1.json",
            "type": "File"
        },
        {
            "name": "json2.json",
            "type": "File"
        },
        {
            "name": "json3.json",
            "type": "File"
        },
        {
            "name": "json4.json",
            "type": "File"
        }
    ],
    "effectiveIntegrationRuntime": "AutoResolveIntegrationRuntime (Some Region)",
    "executionDuration": 0,
    "durationInQueue": {
        "integrationRuntimeQueue": 1
    },
    "billingReference": {
        "activityType": "PipelineActivity",
        "billableDuration": [
            {
                "meterType": "AzureIR",
                "duration": 0.016666666666666666,
                "unit": "Hours"
            }
        ]
    }
}

为了比较输入数组pFilesToCheck（必须存在的文件）与Get Metadataactivity的结果（ do 存在的文件），我们必须将它们放在可比较的格式中。我使用数组变量来执行此操作：

Variable Name	Variable Type
arrFilenames	Array

接下来是 For Each activity 运行在 Sequential 模式下使用 range 函数从 0 循环到3、即childItems数组中每一项的数组索引。该表达式确定 Get Metadata 输出中的项目数这是基于 0 的。 Items 属性设置为以下表达式：

@range(0,length(activity('Get Metadata File List').output.childItems))

在 For Each activity 内部是一个 Append activity，它将 for each 循环中的当前项附加到数组变量 arrFilenames。它在 Value 属性:

中使用此表达式

@activity('Get Metadata File List').output.childItems[item()].name

'@item()' 在这种情况下将是由上述 range 函数生成的介于 0 和 3 之间的数字。一旦循环完成，数组 arrFilenames 现在看起来像这样（即与输入数组格式相同）：

["json1.json","json2.json","json3.json","json4.json"]

现在可以使用 intersection 函数比较输入数组和实际文件列表。我使用带有布尔变量的 Set Variable activity 来记录结果：

@equals(
length(variables('arrFilenames')),
length(intersection(variables('arrFilenames'),pipeline().parameters.pFilesMustExist)))

此表达式将包含实际存在的文件的数组的长度与通过交集函数连接到输入的同一数组的 长度进行比较应该存在的文件数组。如果数字匹配，则所有文件都存在。如果数字不匹配，则不是所有文件都存在。

检查所有文件是否在存储中可用 - Azure ADF

Check if all files are available in storage - Azure ADF

azure-data-factory

azure-data-factory-2

azure-data-factory-pipeline

azure-adf