Spark-BigTable - HBase 客户端未在 Pyspark 中关闭?
Spark-BigTable - HBase client not closed in Pyspark?
我正在尝试执行在 Python for 循环中写入 BigTable 的 Pyspark 语句,这会导致以下错误(使用 Dataproc 提交的作业)。任何客户端未正确关闭(如建议 here),如果是,有什么方法可以在 Pyspark 中关闭?
请注意,每次使用新的 Dataproc 作业手动重新执行脚本都可以正常工作,因此作业本身是正确的。
感谢您的支持!
Pyspark 脚本
from pyspark import SparkContext
from pyspark.sql import SQLContext
import json
sc = SparkContext()
sqlc = SQLContext(sc)
def create_df(n_start,n_stop):
# Data
row_1 = ['a']+['{}'.format(i) for i in range(n_start,n_stop)]
row_2 = ['b']+['{}'.format(i) for i in range(n_start,n_stop)]
# Spark schema
ls = [row_1,row_2]
schema = ['col0'] + ['col{}'.format(i) for i in range(n_start,n_stop)]
# Catalog
first_col = {"col0":{"cf":"rowkey", "col":"key", "type":"string"}}
other_cols = {"col{}".format(i):{"cf":"cf", "col":"col{}".format(i), "type":"string"} for i in range(n_start,n_stop)}
first_col.update(other_cols)
columns = first_col
d_catalogue = {}
d_catalogue["table"] = {"namespace":"default", "name":"testtable"}
d_catalogue["rowkey"] = "key"
d_catalogue["columns"] = columns
catalog = json.dumps(d_catalogue)
# Dataframe
df = sc.parallelize(ls, numSlices=1000).toDF(schema=schema)
return df,catalog
for i in range(0,2):
N_step = 100
N_start = 1
N_stop = N_start+N_step
data_source_format = "org.apache.spark.sql.execution.datasources.hbase"
df,catalog = create_df(N_start,N_stop)
df.write\
.options(catalog=catalog,newTable= "5")\
.format(data_source_format)\
.save()
N_start += N_step
N_stop += N_step
Dataproc 作业
gcloud dataproc jobs submit pyspark <my_script>.py \
--cluster $SPARK_CLUSTER \
--jars <path_to_jar>/bigtable-dataproc-spark-shc-assembly-0.1.jar \
--region=us-east1
错误
...
ERROR com.google.bigtable.repackaged.io.grpc.internal.ManagedChannelOrphanWrapper: *~*~*~ Channel ManagedChannelImpl{logId=41, target=bigtable.googleapis.com:443} was not shutdown properly!!! ~*~*~*
Make sure to call shutdown()/shutdownNow() and wait until awaitTermination() returns true.
...
如果您使用的不是最新版本,try updating to it. It looks similar to this issue 最近已修复。我会想象错误消息仍然出现,但现在完成的工作意味着支持团队仍在努力,希望他们会在下一个版本中修复它。
我正在尝试执行在 Python for 循环中写入 BigTable 的 Pyspark 语句,这会导致以下错误(使用 Dataproc 提交的作业)。任何客户端未正确关闭(如建议 here),如果是,有什么方法可以在 Pyspark 中关闭?
请注意,每次使用新的 Dataproc 作业手动重新执行脚本都可以正常工作,因此作业本身是正确的。
感谢您的支持!
Pyspark 脚本
from pyspark import SparkContext
from pyspark.sql import SQLContext
import json
sc = SparkContext()
sqlc = SQLContext(sc)
def create_df(n_start,n_stop):
# Data
row_1 = ['a']+['{}'.format(i) for i in range(n_start,n_stop)]
row_2 = ['b']+['{}'.format(i) for i in range(n_start,n_stop)]
# Spark schema
ls = [row_1,row_2]
schema = ['col0'] + ['col{}'.format(i) for i in range(n_start,n_stop)]
# Catalog
first_col = {"col0":{"cf":"rowkey", "col":"key", "type":"string"}}
other_cols = {"col{}".format(i):{"cf":"cf", "col":"col{}".format(i), "type":"string"} for i in range(n_start,n_stop)}
first_col.update(other_cols)
columns = first_col
d_catalogue = {}
d_catalogue["table"] = {"namespace":"default", "name":"testtable"}
d_catalogue["rowkey"] = "key"
d_catalogue["columns"] = columns
catalog = json.dumps(d_catalogue)
# Dataframe
df = sc.parallelize(ls, numSlices=1000).toDF(schema=schema)
return df,catalog
for i in range(0,2):
N_step = 100
N_start = 1
N_stop = N_start+N_step
data_source_format = "org.apache.spark.sql.execution.datasources.hbase"
df,catalog = create_df(N_start,N_stop)
df.write\
.options(catalog=catalog,newTable= "5")\
.format(data_source_format)\
.save()
N_start += N_step
N_stop += N_step
Dataproc 作业
gcloud dataproc jobs submit pyspark <my_script>.py \
--cluster $SPARK_CLUSTER \
--jars <path_to_jar>/bigtable-dataproc-spark-shc-assembly-0.1.jar \
--region=us-east1
错误
...
ERROR com.google.bigtable.repackaged.io.grpc.internal.ManagedChannelOrphanWrapper: *~*~*~ Channel ManagedChannelImpl{logId=41, target=bigtable.googleapis.com:443} was not shutdown properly!!! ~*~*~*
Make sure to call shutdown()/shutdownNow() and wait until awaitTermination() returns true.
...
如果您使用的不是最新版本,try updating to it. It looks similar to this issue 最近已修复。我会想象错误消息仍然出现,但现在完成的工作意味着支持团队仍在努力,希望他们会在下一个版本中修复它。