在 Azure Databricks 的集群 Spark Config 中设置数据湖连接
Setting data lake connection in cluster Spark Config for Azure Databricks
我正在尝试在连接到 Azure Data Lake Gen2 帐户的 Azure Databricks 工作区中为 developers/data 科学家简化笔记本创建。现在,每个笔记本的顶部都有这个:
%scala
spark.sparkContext.hadoopConfiguration.set("fs.azure.account.auth.type.<datalake.dfs.core.windows.net", "OAuth")
spark.sparkContext.hadoopConfiguration.set("fs.azure.account.oauth.provider.type.<datalake>.dfs.core.windows.net", "org.apache.hadoop.fs.azurebfs.oauth2.ClientCredsTokenProvider")
spark.sparkContext.hadoopConfiguration.set("fs.azure.account.oauth2.client.id.<datalake>.dfs.core.windows.net", <SP client id>)
spark.sparkContext.hadoopConfiguration.set("fs.azure.account.oauth2.client.secret.<datalake>.dfs.core.windows.net", dbutils.secrets.get(<SP client secret>))
spark.sparkContext.hadoopConfiguration.set("fs.azure.account.oauth2.client.endpoint.<datalake>.dfs.core.windows.net", "https://login.microsoftonline.com/<tenant>/oauth2/token")
我们的实现试图避免在 DBFSS 中安装,因此我一直在尝试查看是否可以在集群上使用 Spark Config 来定义这些值(每个集群可以访问不同的数据湖)。
但是,我还没能让它发挥作用。当我尝试各种口味时:
org.apache.hadoop.fs.azure.account.oauth2.client.id.<datalake>.dfs.core.windows.net <sp client id>
org.apache.hadoop.fs.azure.account.auth.type.<datalake>.dfs.core.windows.net OAuth
org.apache.hadoop.fs.azure.account.oauth.provider.type.<datalake>.dfs.core.windows.net "org.apache.hadoop.fs.azurebfs.oauth2.ClientCredsTokenProvider"
org.apache.hadoop.fs.azure.account.oauth2.client.secret.<datalake>.dfs.core.windows.net {{secrets/secret/secret}}
org.apache.hadoop.fs.azure.account.oauth2.client.endpoint.<datalake>.dfs.core.windows.net "https://login.microsoftonline.com/<tenant>"
我得到“初始化配置失败”上面的版本看起来默认尝试使用存储帐户访问密钥而不是 SP 凭据(这只是用一个简单的 ls
命令进行测试确保它有效)。
ExecutionError: An error occurred while calling z:com.databricks.backend.daemon.dbutils.FSUtils.ls.
: Failure to initialize configuration
at shaded.databricks.v20180920_b33d810.org.apache.hadoop.fs.azurebfs.services.SimpleKeyProvider.getStorageAccountKey(SimpleKeyProvider.java:51)
at shaded.databricks.v20180920_b33d810.org.apache.hadoop.fs.azurebfs.AbfsConfiguration.getStorageAccountKey(AbfsConfiguration.java:412)
at shaded.databricks.v20180920_b33d810.org.apache.hadoop.fs.azurebfs.AzureBlobFileSystemStore.initializeClient(AzureBlobFileSystemStore.java:1016)
我希望有办法解决这个问题,尽管如果唯一的答案是“你不能这样做”,那当然是一个可以接受的答案。
如果想在Azure Databricks集群Spark配置中添加Azure Data Lake Gen2配置,请参考以下配置
spark.hadoop.fs.azure.account.oauth2.client.id.<datalake>.dfs.core.windows.net <sp client id>
spark.hadoop.fs.azure.account.auth.type.<datalake>.dfs.core.windows.net OAuth
spark.hadoop.fs.azure.account.oauth.provider.type.<datalake>.dfs.core.windows.net org.apache.hadoop.fs.azurebfs.oauth2.ClientCredsTokenProvider
org.apache.hadoop.fs.azure.account.oauth2.client.secret.<datalake>.dfs.core.windows.net {{secrets/secret/secret}}
spark.hadoop.fs.azure.account.oauth2.client.endpoint.<datalake>.dfs.core.windows.net https://login.microsoftonline.com/<tenant>/oauth2/token
您可能需要将集群配置为 Access ADLS Gen2 directly
请注意访问机密的格式:Read a secret
The syntax of the Spark configuration property or environment variable path value must be {{secrets/<scope-name>/<secret-name>}}. The value must start with {{secrets/ and end with }}.
所以这一行:
spark.hadoop.fs.azure.account.oauth2.client.secret.<storage-account-name>.dfs.core.windows.net <service-credential>
应该是这样的:
spark.hadoop.fs.azure.account.oauth2.client.secret.yourstorageaccountname.dfs.core.windows.net {{secrets/yoursecretscope/yoursecretname}}
我正在尝试在连接到 Azure Data Lake Gen2 帐户的 Azure Databricks 工作区中为 developers/data 科学家简化笔记本创建。现在,每个笔记本的顶部都有这个:
%scala
spark.sparkContext.hadoopConfiguration.set("fs.azure.account.auth.type.<datalake.dfs.core.windows.net", "OAuth")
spark.sparkContext.hadoopConfiguration.set("fs.azure.account.oauth.provider.type.<datalake>.dfs.core.windows.net", "org.apache.hadoop.fs.azurebfs.oauth2.ClientCredsTokenProvider")
spark.sparkContext.hadoopConfiguration.set("fs.azure.account.oauth2.client.id.<datalake>.dfs.core.windows.net", <SP client id>)
spark.sparkContext.hadoopConfiguration.set("fs.azure.account.oauth2.client.secret.<datalake>.dfs.core.windows.net", dbutils.secrets.get(<SP client secret>))
spark.sparkContext.hadoopConfiguration.set("fs.azure.account.oauth2.client.endpoint.<datalake>.dfs.core.windows.net", "https://login.microsoftonline.com/<tenant>/oauth2/token")
我们的实现试图避免在 DBFSS 中安装,因此我一直在尝试查看是否可以在集群上使用 Spark Config 来定义这些值(每个集群可以访问不同的数据湖)。
但是,我还没能让它发挥作用。当我尝试各种口味时:
org.apache.hadoop.fs.azure.account.oauth2.client.id.<datalake>.dfs.core.windows.net <sp client id>
org.apache.hadoop.fs.azure.account.auth.type.<datalake>.dfs.core.windows.net OAuth
org.apache.hadoop.fs.azure.account.oauth.provider.type.<datalake>.dfs.core.windows.net "org.apache.hadoop.fs.azurebfs.oauth2.ClientCredsTokenProvider"
org.apache.hadoop.fs.azure.account.oauth2.client.secret.<datalake>.dfs.core.windows.net {{secrets/secret/secret}}
org.apache.hadoop.fs.azure.account.oauth2.client.endpoint.<datalake>.dfs.core.windows.net "https://login.microsoftonline.com/<tenant>"
我得到“初始化配置失败”上面的版本看起来默认尝试使用存储帐户访问密钥而不是 SP 凭据(这只是用一个简单的 ls
命令进行测试确保它有效)。
ExecutionError: An error occurred while calling z:com.databricks.backend.daemon.dbutils.FSUtils.ls.
: Failure to initialize configuration
at shaded.databricks.v20180920_b33d810.org.apache.hadoop.fs.azurebfs.services.SimpleKeyProvider.getStorageAccountKey(SimpleKeyProvider.java:51)
at shaded.databricks.v20180920_b33d810.org.apache.hadoop.fs.azurebfs.AbfsConfiguration.getStorageAccountKey(AbfsConfiguration.java:412)
at shaded.databricks.v20180920_b33d810.org.apache.hadoop.fs.azurebfs.AzureBlobFileSystemStore.initializeClient(AzureBlobFileSystemStore.java:1016)
我希望有办法解决这个问题,尽管如果唯一的答案是“你不能这样做”,那当然是一个可以接受的答案。
如果想在Azure Databricks集群Spark配置中添加Azure Data Lake Gen2配置,请参考以下配置
spark.hadoop.fs.azure.account.oauth2.client.id.<datalake>.dfs.core.windows.net <sp client id>
spark.hadoop.fs.azure.account.auth.type.<datalake>.dfs.core.windows.net OAuth
spark.hadoop.fs.azure.account.oauth.provider.type.<datalake>.dfs.core.windows.net org.apache.hadoop.fs.azurebfs.oauth2.ClientCredsTokenProvider
org.apache.hadoop.fs.azure.account.oauth2.client.secret.<datalake>.dfs.core.windows.net {{secrets/secret/secret}}
spark.hadoop.fs.azure.account.oauth2.client.endpoint.<datalake>.dfs.core.windows.net https://login.microsoftonline.com/<tenant>/oauth2/token
您可能需要将集群配置为 Access ADLS Gen2 directly
请注意访问机密的格式:Read a secret
The syntax of the Spark configuration property or environment variable path value must be {{secrets/<scope-name>/<secret-name>}}. The value must start with {{secrets/ and end with }}.
所以这一行:
spark.hadoop.fs.azure.account.oauth2.client.secret.<storage-account-name>.dfs.core.windows.net <service-credential>
应该是这样的:
spark.hadoop.fs.azure.account.oauth2.client.secret.yourstorageaccountname.dfs.core.windows.net {{secrets/yoursecretscope/yoursecretname}}