当指定存储级别时,在 pyspark2 中保留数据框不起作用。我究竟做错了什么?

Persisting a data frame in pyspark2 does not work when a storage level is specified. What am I doing wrong?

我试图在执行连接之前保留两个非常大的数据帧以解决 "java.util.concurrent.TimeoutException: Futures timed out..." 问题(参考:)。

Persist() 单独工作,但当我尝试指定存储级别时,我收到名称错误。

我试过以下方法:

df.persist(pyspark.StorageLevel.MEMORY_ONLY) 
NameError: name 'MEMORY_ONLY' is not defined

df.persist(StorageLevel.MEMORY_ONLY) 
NameError: name 'StorageLevel' is not defined

import org.apache.spark.storage.StorageLevel 
ImportError: No module named org.apache.spark.storage.StorageLevel

如有任何帮助,我们将不胜感激。

您必须导入适当的包:-

from pyspark import StorageLevel

导入pyspark包

import pyspark

以下对我有用:

from pyspark.storagelevel import StorageLevel

df.persist(StorageLevel.MEMORY_ONLY)