.Net for Apache Spark 针对 ADLS (Azure datalake store) gen 1 的身份验证

.Net for Apache Spark authentication against ADLS (Azure datalake store) gen 1

我是 apache spark 的新手。我正在尝试使用 Microsoft apache nuget 库从 ADLS 读取数据。我似乎无法弄清楚如何使用 spark 进行身份验证。似乎根本没有关于此的文档。这可能吗? 我正在编写一个 .Net 框架控制台应用程序。

任何 help/pointers 将不胜感激!

如果您想在Spark中使用Azure数据湖存储,请参考以下步骤。请注意,我使用 spark 3.0.1 with Hadoop 3.2 进行测试

  1. 创建服务主体
az login
az ad sp create-for-rbac --name "myApp" --role contributor --scopes /subscriptions/<subscription-id>/resourceGroups/<group-name> --sdk-auth
  1. 授予服务主体对数据湖的访问权限
Connect-AzAccount
# get sp object id with sp's client id
$sp=Get-AzADServicePrincipal -ApplicationId  42e0d080-b1f3-40cf-8db6-c4c522d988c4

$fullAcl="user:$($sp.Id):rwx,default:user:$($sp.Id):rwx"
$newFullAcl = $fullAcl.Split("{,}")
Set-AdlStoreItemAclEntry -Account <> -Path / -Acl $newFullAcl -Recurse -Debug
  1. 代码
string filePath =
                $"adl://{<account name>}.azuredatalakestore.net/parquet/people.parquet";

            // Create SparkSession
            SparkSession spark = SparkSession
                .Builder()
                .AppName("Azure Data Lake Storage example using .NET for Apache Spark")
                .Config("fs.adl.impl", "org.apache.hadoop.fs.adl.AdlFileSystem")
                .Config("fs.adl.oauth2.access.token.provider.type", "ClientCredential")
                .Config("fs.adl.oauth2.client.id", "<sp appid>")
                .Config("fs.adl.oauth2.credential", "<sp password>")
                .Config("fs.adl.oauth2.refresh.url", $"https://login.microsoftonline.com/<tenant>/oauth2/token")
                .GetOrCreate();

            // Create sample data
            var data = new List<GenericRow>
            {
                new GenericRow(new object[] { 1, "John Doe"}),
                new GenericRow(new object[] { 2, "Jane Doe"}),
                new GenericRow(new object[] { 3, "Foo Bar"})
            };

            // Create schema for sample data
            var schema = new StructType(new List<StructField>()
            {
                new StructField("Id", new IntegerType()),
                new StructField("Name", new StringType()),
            });

            // Create DataFrame using data and schema
            DataFrame df = spark.CreateDataFrame(data, schema);

            // Print DataFrame
            df.Show();

            // Write DataFrame to Azure Data Lake Gen1
            df.Write().Mode(SaveMode.Overwrite).Parquet(filePath);

            // Read saved DataFrame from Azure Data Lake Gen1
            DataFrame readDf = spark.Read().Parquet(filePath);

            // Print DataFrame
            readDf.Show();

            // Stop Spark session
            spark.Stop();
  1. 运行
spark-submit ^
--packages org.apache.hadoop:hadoop-azure-datalake:3.2.0 ^
--class org.apache.spark.deploy.dotnet.DotnetRunner ^
--master local ^
microsoft-spark-3-0_2.12-<version>.jar ^
dotnet <application name>.dll

详情请参考

https://docs.microsoft.com/en-us/azure/data-lake-store/data-lake-store-service-to-service-authenticate-using-active-directory

https://hadoop.apache.org/docs/current/hadoop-azure-datalake/index.html

https://docs.microsoft.com/en-us/azure/data-lake-store/data-lake-store-access-control