由 UTFDataFormatException 引起的 Spark 中的任务不可序列化:编码字符串太长

Task not serializable in Spark caused by UTFDataFormatException: encoded string too long

我在 Yarn 上使用我的 Spark 应用程序时遇到一些问题 运行。我有非常广泛的集成测试 运行ning 没有任何问题但是当我 运行 YARN 上的应用程序时它会抛出以下错误:

17/01/06 11:22:23 ERROR yarn.ApplicationMaster: User class threw exception: org.apache.spark.SparkException: Task not serializable
org.apache.spark.SparkException: Task not serializable
    at org.apache.spark.util.ClosureCleaner$.ensureSerializable(ClosureCleaner.scala:304)
    at org.apache.spark.util.ClosureCleaner$.org$apache$spark$util$ClosureCleaner$$clean(ClosureCleaner.scala:294)
    at org.apache.spark.util.ClosureCleaner$.clean(ClosureCleaner.scala:122)
    at org.apache.spark.SparkContext.clean(SparkContext.scala:2067)
    at org.apache.spark.rdd.RDD$$anonfun$map.apply(RDD.scala:324)
    at org.apache.spark.rdd.RDD$$anonfun$map.apply(RDD.scala:323)
    at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:150)
    at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:111)
    at org.apache.spark.rdd.RDD.withScope(RDD.scala:316)
    at org.apache.spark.rdd.RDD.map(RDD.scala:323)
    at org.apache.spark.sql.DataFrame.map(DataFrame.scala:1410)
    at com.orgx.yy.dd.check.DQCheck$class.runDQCheck(DQCheck.scala:24)
    at com.orgx.yy.dd.check.DQBatchCheck.runDQCheck(DQBatchCheck.scala:13)
    at com.orgx.yy.dd.check.DQBatchCheck.doCheck(DQBatchCheck.scala:23)
    at com.orgx.yy.dd.DQChecker$.main(DQChecker.scala:60)
    at com.orgx.yy.dd.DQChecker.main(DQChecker.scala)
    at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
    at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
    at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
    at java.lang.reflect.Method.invoke(Method.java:497)
    at org.apache.spark.deploy.yarn.ApplicationMaster$$anon.run(ApplicationMaster.scala:542)
Caused by: java.io.UTFDataFormatException: encoded string too long: 72887 bytes
    at java.io.DataOutputStream.writeUTF(DataOutputStream.java:364)
    at java.io.DataOutputStream.writeUTF(DataOutputStream.java:323)
    at com.typesafe.config.impl.SerializedConfigValue.writeValueData(SerializedConfigValue.java:295)
    at com.typesafe.config.impl.SerializedConfigValue.writeValue(SerializedConfigValue.java:369)
    at com.typesafe.config.impl.SerializedConfigValue.writeValueData(SerializedConfigValue.java:309)
    at com.typesafe.config.impl.SerializedConfigValue.writeValue(SerializedConfigValue.java:369)
    at com.typesafe.config.impl.SerializedConfigValue.writeValueData(SerializedConfigValue.java:309)
    at com.typesafe.config.impl.SerializedConfigValue.writeValue(SerializedConfigValue.java:369)
    at com.typesafe.config.impl.SerializedConfigValue.writeValueData(SerializedConfigValue.java:309)
    at com.typesafe.config.impl.SerializedConfigValue.writeValue(SerializedConfigValue.java:369)
    at com.typesafe.config.impl.SerializedConfigValue.writeExternal(SerializedConfigValue.java:435)
    at java.io.ObjectOutputStream.writeExternalData(ObjectOutputStream.java:1459)
    at java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1430)
    at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1178)
    at java.io.ObjectOutputStream.defaultWriteFields(ObjectOutputStream.java:1548)
    at java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1509)
    at java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1432)
    at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1178)
    at java.io.ObjectOutputStream.defaultWriteFields(ObjectOutputStream.java:1548)
    at java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1509)
    at java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1432)
    at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1178)
    at java.io.ObjectOutputStream.writeObject(ObjectOutputStream.java:348)
    at scala.collection.immutable.$colon$colon.writeObject(List.scala:379)
    at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
    at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
    at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
    at java.lang.reflect.Method.invoke(Method.java:497)
    at java.io.ObjectStreamClass.invokeWriteObject(ObjectStreamClass.java:1028)
    at java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1496)
    at java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1432)
    at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1178)
    at java.io.ObjectOutputStream.defaultWriteFields(ObjectOutputStream.java:1548)
    at java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1509)
    at java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1432)
    at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1178)
    at java.io.ObjectOutputStream.writeObject(ObjectOutputStream.java:348)
    at org.apache.spark.serializer.JavaSerializationStream.writeObject(JavaSerializer.scala:44)
    at org.apache.spark.serializer.JavaSerializerInstance.serialize(JavaSerializer.scala:101)
    at org.apache.spark.util.ClosureCleaner$.ensureSerializable(ClosureCleaner.scala:301)
    ... 20 more
17/01/06 11:22:24 INFO yarn.ApplicationMaster: Final app status: FAILED, exitCode: 15, (reason: User class threw exception: org.apache.spark.SparkException: Task not serializable)

罪魁祸首似乎是 java.io.UTFDataFormatException:编码字符串太长:72887 字节。有人知道为什么会这样吗?

我设法解决了这个问题。问题是我将 Typesafe 配置引入了未能序列化的函数正在使用的 classes 之一。通过添加配置,这增加了总内存占用量并超过了 64KB 的限制。

当我从 class 中删除配置对象时,它再次正常工作。