如何将 bigquery-connector 添加到 dataproc 上的现有集群
How to add bigquery-connector to an existing cluster on dataproc
我刚开始使用 dataproc 在 bigquery.When 中对大数据进行机器学习,我尝试 运行 此代码:
df = spark.read.format('bigquery').load('bigquery-public-data.samples.shakespeare')
我遇到这样的错误:
java.lang.ClassNotFoundException: Failed to find data source: bigquery. Please find packages at http://spark.apache.org/third-party-projects.html
我在这个 git 回购中找到了一些类似的教程:https://github.com/GoogleCloudDataproc/spark-bigquery-connector
但是我不知道该把脚本写在哪里 运行 them.Could 你让我明白了吗?
提前致谢
在创建集群时,我打开了 gcp 控制台并输入了这个脚本
gcloud dataproc clusters create clusterName --bucket bucketName --region europe-west3 --zone europe-west3-a --master-machine-type n1-standard-16 --master-boot-disk-type pd-ssd --master-boot-disk-size 200 --num-workers 2 --worker-machine-type n1-highmem-16 --worker-boot-disk-size 200 --image-version 2.0-debian10 --max-idle 3600s --optional-components JUPYTER --initialization-actions 'gs://goog-dataproc-initialization-actions-europe-west3/python/pip-install.sh','gs://goog-dataproc-initialization-actions-europe-west3/connectors/connectors.sh' --metadata 'PIP_PACKAGES=pyspark==3.1.2 tensorflow keras elephas==3.0.0',spark-bigquery-connector-version=0.21.0,bigquery-connector-version=1.2.0 --project projectName --enable-component-gateway
脚本的 -initialization-actions 部分对我有用:
--initialization-actions 'gs://goog-dataproc-initialization-actions-europe-west3/python/pip-install.sh','gs://goog-dataproc-initialization-actions-europe-west3/connectors/connectors.sh' --metadata 'PIP_PACKAGES=pyspark==3.1.2 tensorflow keras elephas==3.0.0',spark-bigquery-connector-version=0.21.0,bigquery-connector-version=1.2.0
我刚开始使用 dataproc 在 bigquery.When 中对大数据进行机器学习,我尝试 运行 此代码:
df = spark.read.format('bigquery').load('bigquery-public-data.samples.shakespeare')
我遇到这样的错误:
java.lang.ClassNotFoundException: Failed to find data source: bigquery. Please find packages at http://spark.apache.org/third-party-projects.html
我在这个 git 回购中找到了一些类似的教程:https://github.com/GoogleCloudDataproc/spark-bigquery-connector
但是我不知道该把脚本写在哪里 运行 them.Could 你让我明白了吗?
提前致谢
在创建集群时,我打开了 gcp 控制台并输入了这个脚本
gcloud dataproc clusters create clusterName --bucket bucketName --region europe-west3 --zone europe-west3-a --master-machine-type n1-standard-16 --master-boot-disk-type pd-ssd --master-boot-disk-size 200 --num-workers 2 --worker-machine-type n1-highmem-16 --worker-boot-disk-size 200 --image-version 2.0-debian10 --max-idle 3600s --optional-components JUPYTER --initialization-actions 'gs://goog-dataproc-initialization-actions-europe-west3/python/pip-install.sh','gs://goog-dataproc-initialization-actions-europe-west3/connectors/connectors.sh' --metadata 'PIP_PACKAGES=pyspark==3.1.2 tensorflow keras elephas==3.0.0',spark-bigquery-connector-version=0.21.0,bigquery-connector-version=1.2.0 --project projectName --enable-component-gateway
脚本的 -initialization-actions 部分对我有用:
--initialization-actions 'gs://goog-dataproc-initialization-actions-europe-west3/python/pip-install.sh','gs://goog-dataproc-initialization-actions-europe-west3/connectors/connectors.sh' --metadata 'PIP_PACKAGES=pyspark==3.1.2 tensorflow keras elephas==3.0.0',spark-bigquery-connector-version=0.21.0,bigquery-connector-version=1.2.0