.Net for Apache Spark 针对 ADLS (Azure datalake store) gen 1 的身份验证
.Net for Apache Spark authentication against ADLS (Azure datalake store) gen 1
我是 apache spark 的新手。我正在尝试使用 Microsoft apache nuget 库从 ADLS 读取数据。我似乎无法弄清楚如何使用 spark 进行身份验证。似乎根本没有关于此的文档。这可能吗?
我正在编写一个 .Net 框架控制台应用程序。
任何 help/pointers 将不胜感激!
如果您想在Spark中使用Azure数据湖存储,请参考以下步骤。请注意,我使用 spark 3.0.1 with Hadoop 3.2 进行测试
- 创建服务主体
az login
az ad sp create-for-rbac --name "myApp" --role contributor --scopes /subscriptions/<subscription-id>/resourceGroups/<group-name> --sdk-auth
- 授予服务主体对数据湖的访问权限
Connect-AzAccount
# get sp object id with sp's client id
$sp=Get-AzADServicePrincipal -ApplicationId 42e0d080-b1f3-40cf-8db6-c4c522d988c4
$fullAcl="user:$($sp.Id):rwx,default:user:$($sp.Id):rwx"
$newFullAcl = $fullAcl.Split("{,}")
Set-AdlStoreItemAclEntry -Account <> -Path / -Acl $newFullAcl -Recurse -Debug
- 代码
string filePath =
$"adl://{<account name>}.azuredatalakestore.net/parquet/people.parquet";
// Create SparkSession
SparkSession spark = SparkSession
.Builder()
.AppName("Azure Data Lake Storage example using .NET for Apache Spark")
.Config("fs.adl.impl", "org.apache.hadoop.fs.adl.AdlFileSystem")
.Config("fs.adl.oauth2.access.token.provider.type", "ClientCredential")
.Config("fs.adl.oauth2.client.id", "<sp appid>")
.Config("fs.adl.oauth2.credential", "<sp password>")
.Config("fs.adl.oauth2.refresh.url", $"https://login.microsoftonline.com/<tenant>/oauth2/token")
.GetOrCreate();
// Create sample data
var data = new List<GenericRow>
{
new GenericRow(new object[] { 1, "John Doe"}),
new GenericRow(new object[] { 2, "Jane Doe"}),
new GenericRow(new object[] { 3, "Foo Bar"})
};
// Create schema for sample data
var schema = new StructType(new List<StructField>()
{
new StructField("Id", new IntegerType()),
new StructField("Name", new StringType()),
});
// Create DataFrame using data and schema
DataFrame df = spark.CreateDataFrame(data, schema);
// Print DataFrame
df.Show();
// Write DataFrame to Azure Data Lake Gen1
df.Write().Mode(SaveMode.Overwrite).Parquet(filePath);
// Read saved DataFrame from Azure Data Lake Gen1
DataFrame readDf = spark.Read().Parquet(filePath);
// Print DataFrame
readDf.Show();
// Stop Spark session
spark.Stop();
- 运行
spark-submit ^
--packages org.apache.hadoop:hadoop-azure-datalake:3.2.0 ^
--class org.apache.spark.deploy.dotnet.DotnetRunner ^
--master local ^
microsoft-spark-3-0_2.12-<version>.jar ^
dotnet <application name>.dll
详情请参考
https://hadoop.apache.org/docs/current/hadoop-azure-datalake/index.html
https://docs.microsoft.com/en-us/azure/data-lake-store/data-lake-store-access-control
我是 apache spark 的新手。我正在尝试使用 Microsoft apache nuget 库从 ADLS 读取数据。我似乎无法弄清楚如何使用 spark 进行身份验证。似乎根本没有关于此的文档。这可能吗? 我正在编写一个 .Net 框架控制台应用程序。
任何 help/pointers 将不胜感激!
如果您想在Spark中使用Azure数据湖存储,请参考以下步骤。请注意,我使用 spark 3.0.1 with Hadoop 3.2 进行测试
- 创建服务主体
az login
az ad sp create-for-rbac --name "myApp" --role contributor --scopes /subscriptions/<subscription-id>/resourceGroups/<group-name> --sdk-auth
- 授予服务主体对数据湖的访问权限
Connect-AzAccount
# get sp object id with sp's client id
$sp=Get-AzADServicePrincipal -ApplicationId 42e0d080-b1f3-40cf-8db6-c4c522d988c4
$fullAcl="user:$($sp.Id):rwx,default:user:$($sp.Id):rwx"
$newFullAcl = $fullAcl.Split("{,}")
Set-AdlStoreItemAclEntry -Account <> -Path / -Acl $newFullAcl -Recurse -Debug
- 代码
string filePath =
$"adl://{<account name>}.azuredatalakestore.net/parquet/people.parquet";
// Create SparkSession
SparkSession spark = SparkSession
.Builder()
.AppName("Azure Data Lake Storage example using .NET for Apache Spark")
.Config("fs.adl.impl", "org.apache.hadoop.fs.adl.AdlFileSystem")
.Config("fs.adl.oauth2.access.token.provider.type", "ClientCredential")
.Config("fs.adl.oauth2.client.id", "<sp appid>")
.Config("fs.adl.oauth2.credential", "<sp password>")
.Config("fs.adl.oauth2.refresh.url", $"https://login.microsoftonline.com/<tenant>/oauth2/token")
.GetOrCreate();
// Create sample data
var data = new List<GenericRow>
{
new GenericRow(new object[] { 1, "John Doe"}),
new GenericRow(new object[] { 2, "Jane Doe"}),
new GenericRow(new object[] { 3, "Foo Bar"})
};
// Create schema for sample data
var schema = new StructType(new List<StructField>()
{
new StructField("Id", new IntegerType()),
new StructField("Name", new StringType()),
});
// Create DataFrame using data and schema
DataFrame df = spark.CreateDataFrame(data, schema);
// Print DataFrame
df.Show();
// Write DataFrame to Azure Data Lake Gen1
df.Write().Mode(SaveMode.Overwrite).Parquet(filePath);
// Read saved DataFrame from Azure Data Lake Gen1
DataFrame readDf = spark.Read().Parquet(filePath);
// Print DataFrame
readDf.Show();
// Stop Spark session
spark.Stop();
- 运行
spark-submit ^
--packages org.apache.hadoop:hadoop-azure-datalake:3.2.0 ^
--class org.apache.spark.deploy.dotnet.DotnetRunner ^
--master local ^
microsoft-spark-3-0_2.12-<version>.jar ^
dotnet <application name>.dll
详情请参考
https://hadoop.apache.org/docs/current/hadoop-azure-datalake/index.html
https://docs.microsoft.com/en-us/azure/data-lake-store/data-lake-store-access-control