尝试使用 datastax cassandra 连接器启动 spark thrift 服务器

trying to start spark thrift server with datastax cassandra connector

我已经启动了 spark-thrift 服务器并使用直线连接到 thrift 服务器。尝试在配置单元 Metastore 中查询创建 table 时出现以下错误。

创建table

create table meeting_details using org.apache.spark.sql.cassandra options (keyspace ‘TravelData’, table ‘meeting_details’)
select * from meeting_details

出现以下错误。

这是 运行 在 macOS 中。

org.apache.spark.sql.cassandra 不是有效的 Spark SQL 数据源。

0: jdbc:hive2://localhost:10000> select * 来自 traveldata.employee_details;

Error: org.apache.hive.service.cli.HiveSQLException: Error running query: java.util.concurrent.ExecutionException: org.apache.spark.sql.AnalysisException: org.apache.spark.sql.cassandra is not a valid Spark SQL Data Source.
    at org.apache.spark.sql.hive.thriftserver.SparkExecuteStatementOperation.org$apache$spark$sql$hive$thriftserver$SparkExecuteStatementOperation$$execute(SparkExecuteStatementOperation.scala:361)
    at org.apache.spark.sql.hive.thriftserver.SparkExecuteStatementOperation$$anon$$anon.$anonfun$run(SparkExecuteStatementOperation.scala:263)
    at scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23)
    at org.apache.spark.sql.hive.thriftserver.SparkOperation.withLocalProperties(SparkOperation.scala:78)
    at org.apache.spark.sql.hive.thriftserver.SparkOperation.withLocalProperties$(SparkOperation.scala:62)
    at org.apache.spark.sql.hive.thriftserver.SparkExecuteStatementOperation.withLocalProperties(SparkExecuteStatementOperation.scala:43)
    at org.apache.spark.sql.hive.thriftserver.SparkExecuteStatementOperation$$anon$$anon.run(SparkExecuteStatementOperation.scala:263)
    at org.apache.spark.sql.hive.thriftserver.SparkExecuteStatementOperation$$anon$$anon.run(SparkExecuteStatementOperation.scala:258)
    at java.base/java.security.AccessController.doPrivileged(Native Method)
    at java.base/javax.security.auth.Subject.doAs(Subject.java:423)
    at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1730)
    at org.apache.spark.sql.hive.thriftserver.SparkExecuteStatementOperation$$anon.run(SparkExecuteStatementOperation.scala:272)
    at java.base/java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:515)
    at java.base/java.util.concurrent.FutureTask.run(FutureTask.java:264)
    at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
    at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
    at java.base/java.lang.Thread.run(Thread.java:834)
Caused by: java.util.concurrent.ExecutionException: org.apache.spark.sql.AnalysisException: org.apache.spark.sql.cassandra is not a valid Spark SQL Data Source.
    at org.sparkproject.guava.util.concurrent.AbstractFuture$Sync.getValue(AbstractFuture.java:306)
    at org.sparkproject.guava.util.concurrent.AbstractFuture$Sync.get(AbstractFuture.java:293)
    at org.sparkproject.guava.util.concurrent.AbstractFuture.get(AbstractFuture.java:116)
    at org.sparkproject.guava.util.concurrent.Uninterruptibles.getUninterruptibly(Uninterruptibles.java:135)
    at org.sparkproject.guava.cache.LocalCache$Segment.getAndRecordStats(LocalCache.java:2410)
    at org.sparkproject.guava.cache.LocalCache$Segment.loadSync(LocalCache.java:2380)
    at org.sparkproject.guava.cache.LocalCache$Segment.lockedGetOrLoad(LocalCache.java:2342)
    at org.sparkproject.guava.cache.LocalCache$Segment.get(LocalCache.java:2257)
    at org.sparkproject.guava.cache.LocalCache.get(LocalCache.java:4000)
    at org.sparkproject.guava.cache.LocalCache$LocalManualCache.get(LocalCache.java:4789)
    at org.apache.spark.sql.catalyst.catalog.SessionCatalog.getCachedPlan(SessionCatalog.scala:155)
    at org.apache.spark.sql.execution.datasources.FindDataSourceTable.org$apache$spark$sql$execution$datasources$FindDataSourceTable$$readDataSourceTable(DataSourceStrategy.scala:249)
    at org.apache.spark.sql.execution.datasources.FindDataSourceTable$$anonfun$apply.applyOrElse(DataSourceStrategy.scala:288)
    at org.apache.spark.sql.execution.datasources.FindDataSourceTable$$anonfun$apply.applyOrElse(DataSourceStrategy.scala:278)
    at org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper.$anonfun$resolveOperatorsDown(AnalysisHelper.scala:108)
    at org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:73)
    at org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper.$anonfun$resolveOperatorsDown(AnalysisHelper.scala:108)
    at org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper$.allowInvokingTransformsInAnalyzer(AnalysisHelper.scala:221)
    at org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper.resolveOperatorsDown(AnalysisHelper.scala:106)
    at org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper.resolveOperatorsDown$(AnalysisHelper.scala:104)
    at org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.resolveOperatorsDown(LogicalPlan.scala:29)
    at org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper.$anonfun$resolveOperatorsDown(AnalysisHelper.scala:113)
    at org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$mapChildren(TreeNode.scala:407)
    at org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:243)
    at org.apache.spark.sql.catalyst.trees.TreeNode.mapChildren(TreeNode.scala:405)
    at org.apache.spark.sql.catalyst.trees.TreeNode.mapChildren(TreeNode.scala:358)
    at org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper.$anonfun$resolveOperatorsDown(AnalysisHelper.scala:113)
    at org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper$.allowInvokingTransformsInAnalyzer(AnalysisHelper.scala:221)
    at org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper.resolveOperatorsDown(AnalysisHelper.scala:106)
    at org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper.resolveOperatorsDown$(AnalysisHelper.scala:104)
    at org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.resolveOperatorsDown(LogicalPlan.scala:29)
    at org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper.$anonfun$resolveOperatorsDown(AnalysisHelper.scala:113)
    at org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$mapChildren(TreeNode.scala:407)
    at org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:243)
    at org.apache.spark.sql.catalyst.trees.TreeNode.mapChildren(TreeNode.scala:405)
    at org.apache.spark.sql.catalyst.trees.TreeNode.mapChildren(TreeNode.scala:358)
    at org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper.$anonfun$resolveOperatorsDown(AnalysisHelper.scala:113)
    at org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper$.allowInvokingTransformsInAnalyzer(AnalysisHelper.scala:221)
    at org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper.resolveOperatorsDown(AnalysisHelper.scala:106)
    at org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper.resolveOperatorsDown$(AnalysisHelper.scala:104)
    at org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.resolveOperatorsDown(LogicalPlan.scala:29)
    at org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper.resolveOperators(AnalysisHelper.scala:73)
    at org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper.resolveOperators$(AnalysisHelper.scala:72)
    at org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.resolveOperators(LogicalPlan.scala:29)
    at org.apache.spark.sql.execution.datasources.FindDataSourceTable.apply(DataSourceStrategy.scala:278)
    at org.apache.spark.sql.execution.datasources.FindDataSourceTable.apply(DataSourceStrategy.scala:243)
    at org.apache.spark.sql.catalyst.rules.RuleExecutor.$anonfun$execute(RuleExecutor.scala:216)
    at scala.collection.LinearSeqOptimized.foldLeft(LinearSeqOptimized.scala:126)
    at scala.collection.LinearSeqOptimized.foldLeft$(LinearSeqOptimized.scala:122)
    at scala.collection.immutable.List.foldLeft(List.scala:89)
    at org.apache.spark.sql.catalyst.rules.RuleExecutor.$anonfun$execute(RuleExecutor.scala:213)
    at org.apache.spark.sql.catalyst.rules.RuleExecutor.$anonfun$execute$adapted(RuleExecutor.scala:205)
    at scala.collection.immutable.List.foreach(List.scala:392)
    at org.apache.spark.sql.catalyst.rules.RuleExecutor.execute(RuleExecutor.scala:205)
    at org.apache.spark.sql.catalyst.analysis.Analyzer.org$apache$spark$sql$catalyst$analysis$Analyzer$$executeSameContext(Analyzer.scala:196)
    at org.apache.spark.sql.catalyst.analysis.Analyzer.execute(Analyzer.scala:190)
    at org.apache.spark.sql.catalyst.analysis.Analyzer.execute(Analyzer.scala:155)
    at org.apache.spark.sql.catalyst.rules.RuleExecutor.$anonfun$executeAndTrack(RuleExecutor.scala:183)
    at org.apache.spark.sql.catalyst.QueryPlanningTracker$.withTracker(QueryPlanningTracker.scala:88)
    at org.apache.spark.sql.catalyst.rules.RuleExecutor.executeAndTrack(RuleExecutor.scala:183)
    at org.apache.spark.sql.catalyst.analysis.Analyzer.$anonfun$executeAndCheck(Analyzer.scala:174)
    at org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper$.markInAnalyzer(AnalysisHelper.scala:228)
    at org.apache.spark.sql.catalyst.analysis.Analyzer.executeAndCheck(Analyzer.scala:173)
    at org.apache.spark.sql.execution.QueryExecution.$anonfun$analyzed(QueryExecution.scala:73)
    at org.apache.spark.sql.catalyst.QueryPlanningTracker.measurePhase(QueryPlanningTracker.scala:111)
    at org.apache.spark.sql.execution.QueryExecution.$anonfun$executePhase(QueryExecution.scala:143)
    at org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:772)
    at org.apache.spark.sql.execution.QueryExecution.executePhase(QueryExecution.scala:143)
    at org.apache.spark.sql.execution.QueryExecution.analyzed$lzycompute(QueryExecution.scala:73)
    at org.apache.spark.sql.execution.QueryExecution.analyzed(QueryExecution.scala:71)
    at org.apache.spark.sql.execution.QueryExecution.assertAnalyzed(QueryExecution.scala:63)
    at org.apache.spark.sql.Dataset$.$anonfun$ofRows(Dataset.scala:98)
    at org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:772)
    at org.apache.spark.sql.Dataset$.ofRows(Dataset.scala:96)
    at org.apache.spark.sql.SparkSession.$anonfun$sql(SparkSession.scala:615)
    at org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:772)
    at org.apache.spark.sql.SparkSession.sql(SparkSession.scala:610)
    at org.apache.spark.sql.SQLContext.sql(SQLContext.scala:650)
    at org.apache.spark.sql.hive.thriftserver.SparkExecuteStatementOperation.org$apache$spark$sql$hive$thriftserver$SparkExecuteStatementOperation$$execute(SparkExecuteStatementOperation.scala:325)
    ... 16 more
Caused by: org.apache.spark.sql.AnalysisException: org.apache.spark.sql.cassandra is not a valid Spark SQL Data Source.
    at org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:431)
    at org.apache.spark.sql.execution.datasources.FindDataSourceTable.$anonfun$readDataSourceTable(DataSourceStrategy.scala:261)
    at org.sparkproject.guava.cache.LocalCache$LocalManualCache.load(LocalCache.java:4792)
    at org.sparkproject.guava.cache.LocalCache$LoadingValueReference.loadFuture(LocalCache.java:3599)
    at org.sparkproject.guava.cache.LocalCache$Segment.loadSync(LocalCache.java:2379)
    ... 89 more (state=,code=0)
0: jdbc:hive2://localhost:10000> Closing: 0: jdbc:hive2://localhost:10000
^C%

您需要以与启动相同的方式启动 thrift 服务器 spark-shell/pyspark/spark-submit -> 您需要指定包和所有其他属性(参见 quickstart docs):

sbin/start-thriftserver.sh \
  --packages com.datastax.spark:spark-cassandra-connector_2.12:3.0.1 \
  --conf spark.cassandra.connection.host=127.0.0.1 \
  --conf spark.sql.extensions=com.datastax.spark.connector.CassandraSparkExtensions \
  --conf spark.sql.catalog.mycatalog=com.datastax.spark.connector.datasource.CassandraCatalog

然后使用:

>bin/beeline
Beeline version 2.3.7 by Apache Hive
beeline> !connect   jdbc:hive2://localhost:10000
Connecting to jdbc:hive2://localhost:10000
Enter username for jdbc:hive2://localhost:10000:
Enter password for jdbc:hive2://localhost:10000:
Connected to: Spark SQL (version 3.0.1)
Driver: Hive JDBC (version 2.3.7)
Transaction isolation: TRANSACTION_REPEATABLE_READ
0: jdbc:hive2://localhost:10000> SHOW NAMESPACES FROM mycatalog;
+------------+
| namespace  |
+------------+
| test       |
| zep        |
+------------+
2 rows selected (3,072 seconds)
0: jdbc:hive2://localhost:10000> SHOW TABLES FROM mycatalog.test;
+------------+-------------+
| namespace  |  tableName  |
+------------+-------------+
| test       | jtest1      |
| test       | roadworks5  |
| test       | zep1        |
+------------+-------------+
3 rows selected (0,139 seconds)