SnappyData : java.lang.OutOfMemoryError: GC overhead limit exceeded

SnappyData : java.lang.OutOfMemoryError: GC overhead limit exceeded

我在 S3 上有 1.2GB 的 orc 数据,我正在尝试使用相同的方法执行以下操作:

1) 在 snappy 集群上缓存数据 [snappydata 0.9]

2) 对缓存数据集执行groupby查询

3) 与 Spark 2.0.0 进行性能比较

我用的是64GB/8核机器,Snappy集群配置如下:

$ cat locators
localhost

$cat leads
localhost -heap-size=4096m -spark.executor.cores=1

$cat servers
localhost -heap-size=6144m
localhost -heap-size=6144m
localhost -heap-size=6144m
localhost -heap-size=6144m
localhost -heap-size=6144m
localhost -heap-size=6144m

现在,我写了一个小的python脚本,用来缓存来自S3的orc数据和运行一个简单的分组查询,如下:

from pyspark.sql.snappy import SnappyContext
from pyspark import SparkContext,SparkConf
conf = SparkConf().setAppName('snappy_sample')
sc = SparkContext(conf=conf)
sqlContext = SnappyContext(sc)

sqlContext.sql("CREATE EXTERNAL TABLE if not exists my_schema.my_table using orc options(path 's3a://access_key:secret_key@bucket_name/path')")
sqlContext.cacheTable("my_schema.my_table")

out = sqlContext.sql("select *  from my_schema.my_table where (WeekId = '1') order by sum_viewcount desc limit 25")
out.collect()

以上脚本使用以下命令执行:

spark-submit --master local[*] snappy_sample.py

我收到以下错误:

17/10/04 02:50:32 WARN memory.MemoryStore: Not enough space to cache rdd_2_5 in memory! (computed 21.2 MB so far)
17/10/04 02:50:32 WARN memory.MemoryStore: Not enough space to cache rdd_2_0 in memory! (computed 21.2 MB so far)
17/10/04 02:50:32 WARN storage.BlockManager: Persisting block rdd_2_5 to disk instead.
17/10/04 02:50:32 WARN storage.BlockManager: Persisting block rdd_2_0 to disk instead.
17/10/04 02:50:47 WARN storage.BlockManager: Putting block rdd_2_2 failed due to an exception
17/10/04 02:50:47 WARN storage.BlockManager: Block rdd_2_2 could not be removed as it was not found on disk or in memory
17/10/04 02:50:47 ERROR executor.Executor: Exception in task 2.0 in stage 0.0 (TID 2)
java.lang.OutOfMemoryError: GC overhead limit exceeded


at java.nio.HeapByteBuffer.<init>(HeapByteBuffer.java:57)
    at java.nio.ByteBuffer.allocate(ByteBuffer.java:335)
    at org.apache.spark.sql.execution.columnar.compression.CompressibleColumnBuilder$class.build(CompressibleColumnBuilder.scala:96)
    at org.apache.spark.sql.execution.columnar.NativeColumnBuilder.build(ColumnBuilder.scala:97)
    at org.apache.spark.sql.execution.columnar.InMemoryRelation$$anonfun$$anon$$anonfun$next.apply(InMemoryRelation.scala:135)
    at org.apache.spark.sql.execution.columnar.InMemoryRelation$$anonfun$$anon$$anonfun$next.apply(InMemoryRelation.scala:134)
    at scala.collection.TraversableLike$$anonfun$map.apply(TraversableLike.scala:234)
    at scala.collection.TraversableLike$$anonfun$map.apply(TraversableLike.scala:234)
    at scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33)
    at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:186)
    at scala.collection.TraversableLike$class.map(TraversableLike.scala:234)
    at scala.collection.mutable.ArrayOps$ofRef.map(ArrayOps.scala:186)
    at org.apache.spark.sql.execution.columnar.InMemoryRelation$$anonfun$$anon.next(InMemoryRelation.scala:134)
    at org.apache.spark.sql.execution.columnar.InMemoryRelation$$anonfun$$anon.next(InMemoryRelation.scala:98)
    at org.apache.spark.storage.memory.MemoryStore.putIteratorAsValues(MemoryStore.scala:232)
    at org.apache.spark.storage.BlockManager$$anonfun$doPutIterator.apply(BlockManager.scala:935)
    at org.apache.spark.storage.BlockManager$$anonfun$doPutIterator.apply(BlockManager.scala:926)
    at org.apache.spark.storage.BlockManager.doPut(BlockManager.scala:866)
    at org.apache.spark.storage.BlockManager.doPutIterator(BlockManager.scala:926)
    at org.apache.spark.storage.BlockManager.getOrElseUpdate(BlockManager.scala:670)
    at org.apache.spark.rdd.RDD.getOrCompute(RDD.scala:331)
    at org.apache.spark.rdd.RDD.iterator(RDD.scala:282)
    at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
    at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:320)
    at org.apache.spark.rdd.RDD.iterator(RDD.scala:284)
    at org.apache.spark.sql.execution.WholeStageCodegenRDD.compute(WholeStageCodegenExec.scala:496)
    at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:320)
    at org.apache.spark.rdd.RDD.iterator(RDD.scala:284)
    at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
    at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:320)
    at org.apache.spark.rdd.RDD.iterator(RDD.scala:284)
    at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
17/10/04 02:50:47 ERROR util.SparkUncaughtExceptionHandler: Uncaught exception in thread Thread[Executor task launch worker-2,5,main]
java.lang.OutOfMemoryError: GC overhead limit exceeded
    at java.nio.HeapByteBuffer.<init>(HeapByteBuffer.java:57)
    at java.nio.ByteBuffer.allocate(ByteBuffer.java:335)
    at org.apache.spark.sql.execution.columnar.compression.CompressibleColumnBuilder$class.build(CompressibleColumnBuilder.scala:96)
    at org.apache.spark.sql.execution.columnar.NativeColumnBuilder.build(ColumnBuilder.scala:97)
    at org.apache.spark.sql.execution.columnar.InMemoryRelation$$anonfun$$anon$$anonfun$next.apply(InMemoryRelation.scala:135)
    at org.apache.spark.sql.execution.columnar.InMemoryRelation$$anonfun$$anon$$anonfun$next.apply(InMemoryRelation.scala:134)
    at scala.collection.TraversableLike$$anonfun$map.apply(TraversableLike.scala:234)
    at scala.collection.TraversableLike$$anonfun$map.apply(TraversableLike.scala:234)
    at scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33)
    at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:186)
    at scala.collection.TraversableLike$class.map(TraversableLike.scala:234)
    at scala.collection.mutable.ArrayOps$ofRef.map(ArrayOps.scala:186)
    at org.apache.spark.sql.execution.columnar.InMemoryRelation$$anonfun$$anon.next(InMemoryRelation.scala:134)
    at org.apache.spark.sql.execution.columnar.InMemoryRelation$$anonfun$$anon.next(InMemoryRelation.scala:98)
    at org.apache.spark.storage.memory.MemoryStore.putIteratorAsValues(MemoryStore.scala:232)
    at org.apache.spark.storage.BlockManager$$anonfun$doPutIterator.apply(BlockManager.scala:935)
    at org.apache.spark.storage.BlockManager$$anonfun$doPutIterator.apply(BlockManager.scala:926)
    at org.apache.spark.storage.BlockManager.doPut(BlockManager.scala:866)
    at org.apache.spark.storage.BlockManager.doPutIterator(BlockManager.scala:926)
    at org.apache.spark.storage.BlockManager.getOrElseUpdate(BlockManager.scala:670)
    at org.apache.spark.rdd.RDD.getOrCompute(RDD.scala:331)
    at org.apache.spark.rdd.RDD.iterator(RDD.scala:282)
    at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
    at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:320)
    at org.apache.spark.rdd.RDD.iterator(RDD.scala:284)
    at org.apache.spark.sql.execution.WholeStageCodegenRDD.compute(WholeStageCodegenExec.scala:496)
    at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:320)
    at org.apache.spark.rdd.RDD.iterator(RDD.scala:284)
    at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
    at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:320)
    at org.apache.spark.rdd.RDD.iterator(RDD.scala:284)
    at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
17/10/04 02:50:48 INFO snappystore: VM is exiting - shutting down distributed system

除了上述错误,如何检查数据是否已缓存到 snappy 集群中?

1) 首先,您似乎没有使用 python 脚本连接到 SnappyData 集群,而是 运行 在本地模式下连接它。在那种情况下,由 python 脚本启动的 JVM 会像预期的那样因 OOM 而失败。使用 python 以“smart connector”模式连接到 SnappyData 集群时:

spark-submit --master local[*] --conf snappydata.connection=locator:1527 snappy_sample.py

上面的host:port是thrift服务器所在的定位器主机和端口运行ning(默认1527)。

2) 其次,您所拥有的示例将使用 Spark 进行缓存。如果要使用 SnappyData,请加载到列 table:

from pyspark.sql.snappy import SnappySession
from pyspark import SparkContext,SparkConf
conf = SparkConf().setAppName('snappy_sample')
sc = SparkContext(conf=conf)
session = SnappySession(sc)

session.sql("CREATE EXTERNAL TABLE if not exists my_table using orc options(path 's3a://access_key:secret_key@bucket_name/path')")
session.table("my_table").write.format("column").saveAsTable("my_column_table")

out = session.sql("select *  from my_column_table where (WeekId = '1') order by sum_viewcount desc limit 25")
out.collect()

另请注意使用 "SnappySession" 而不是自 Spark 2.0.x 以来已弃用的上下文。与 Spark 缓存进行比较时,您可以在单独的脚本中使用 "cacheTable",而对上游 Spark 使用 运行。请注意,"cacheTable" 将延迟执行缓存,这意味着第一个查询将执行实际缓存,因此第一个查询 运行 使用 Spark 缓存会非常慢,但后续查询应该更快。

3) 更新到 1.0 版本,它有很多改进,而不是使用 0.9。在启动集群之前,您还需要将 hadoop-aws-2.7.3 and aws-java-sdk-1.7.4 添加到 conf/leads 和 conf/servers 中的“-classpath”(或放入产品的 jars 目录中)。