无法将本地文件加载到 PySpark Dataframe
Cannot load local file into PySpark Dataframe
我是 MacOS 用户,我刚刚下载了 Apache Spark。然后我把它放在 /usr/local/spark
中。
这是我的 .bash_profile
:
export SPARK_HOME="/usr/local/spark"
export PYSPARK_PYTHON=python3
export PATH=$PATH:$SPARK_HOME/bin
#export PYSPARK_DRIVER_PYTHON="jupyter"
#export PYSPARK_DRIVER_PYTHON_OPTS="notebook"
问题是,输入pyspark时输入pyspark shell,然后输入这两行:
spark = SparkSession.builder.appName("preprocessing").config("spark-master", "local").getOrCreate()
df = spark.read.format("csv").option("header","true").option("inferSchema", "true").option("delimiter",",").load("src/census-income.data")
发生错误:
2018-10-02 19:55:24 ERROR PoolWatchThread:118 - Error in trying to obtain a connection. Retrying in 7000ms
java.sql.SQLException: A read-only user or a user in a read-only database is not permitted to disable read-only mode on a connection.
at org.apache.derby.impl.jdbc.SQLExceptionFactory.getSQLException(Unknown Source)
at org.apache.derby.impl.jdbc.Util.generateCsSQLException(Unknown Source)
at org.apache.derby.impl.jdbc.TransactionResourceImpl.wrapInSQLException(Unknown Source)
at org.apache.derby.impl.jdbc.TransactionResourceImpl.handleException(Unknown Source)
at org.apache.derby.impl.jdbc.EmbedConnection.handleException(Unknown Source)
at org.apache.derby.impl.jdbc.EmbedConnection.setReadOnly(Unknown Source)
at com.jolbox.bonecp.ConnectionHandle.setReadOnly(ConnectionHandle.java:1324)
at com.jolbox.bonecp.ConnectionHandle.<init>(ConnectionHandle.java:262)
at com.jolbox.bonecp.PoolWatchThread.fillConnections(PoolWatchThread.java:115)
at com.jolbox.bonecp.PoolWatchThread.run(PoolWatchThread.java:82)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
Caused by: ERROR 25505: A read-only user or a user in a read-only database is not permitted to disable read-only mode on a connection.
at org.apache.derby.iapi.error.StandardException.newException(Unknown Source)
at org.apache.derby.iapi.error.StandardException.newException(Unknown Source)
at org.apache.derby.impl.sql.conn.GenericAuthorizer.setReadOnlyConnection(Unknown Source)
at org.apache.derby.impl.sql.conn.GenericLanguageConnectionContext.setReadOnly(Unknown Source)
... 8 more
- Spark 版本:2.3.2
- Python版本:3.7.0
您能尝试从当前目录 (SPARK_HOME) 中删除文件 metastore_db/dbex.lck 吗?
Spark 正在尝试从 HDFS 加载。显然你没有安装 hadoop 并且 spark 无法连接到 HDFS。
如果你想从加载加载你必须明确指定:
file:///src/census-income.data
我是 MacOS 用户,我刚刚下载了 Apache Spark。然后我把它放在 /usr/local/spark
中。
这是我的 .bash_profile
:
export SPARK_HOME="/usr/local/spark"
export PYSPARK_PYTHON=python3
export PATH=$PATH:$SPARK_HOME/bin
#export PYSPARK_DRIVER_PYTHON="jupyter"
#export PYSPARK_DRIVER_PYTHON_OPTS="notebook"
问题是,输入pyspark时输入pyspark shell,然后输入这两行:
spark = SparkSession.builder.appName("preprocessing").config("spark-master", "local").getOrCreate()
df = spark.read.format("csv").option("header","true").option("inferSchema", "true").option("delimiter",",").load("src/census-income.data")
发生错误:
2018-10-02 19:55:24 ERROR PoolWatchThread:118 - Error in trying to obtain a connection. Retrying in 7000ms
java.sql.SQLException: A read-only user or a user in a read-only database is not permitted to disable read-only mode on a connection.
at org.apache.derby.impl.jdbc.SQLExceptionFactory.getSQLException(Unknown Source)
at org.apache.derby.impl.jdbc.Util.generateCsSQLException(Unknown Source)
at org.apache.derby.impl.jdbc.TransactionResourceImpl.wrapInSQLException(Unknown Source)
at org.apache.derby.impl.jdbc.TransactionResourceImpl.handleException(Unknown Source)
at org.apache.derby.impl.jdbc.EmbedConnection.handleException(Unknown Source)
at org.apache.derby.impl.jdbc.EmbedConnection.setReadOnly(Unknown Source)
at com.jolbox.bonecp.ConnectionHandle.setReadOnly(ConnectionHandle.java:1324)
at com.jolbox.bonecp.ConnectionHandle.<init>(ConnectionHandle.java:262)
at com.jolbox.bonecp.PoolWatchThread.fillConnections(PoolWatchThread.java:115)
at com.jolbox.bonecp.PoolWatchThread.run(PoolWatchThread.java:82)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
Caused by: ERROR 25505: A read-only user or a user in a read-only database is not permitted to disable read-only mode on a connection.
at org.apache.derby.iapi.error.StandardException.newException(Unknown Source)
at org.apache.derby.iapi.error.StandardException.newException(Unknown Source)
at org.apache.derby.impl.sql.conn.GenericAuthorizer.setReadOnlyConnection(Unknown Source)
at org.apache.derby.impl.sql.conn.GenericLanguageConnectionContext.setReadOnly(Unknown Source)
... 8 more
- Spark 版本:2.3.2
- Python版本:3.7.0
您能尝试从当前目录 (SPARK_HOME) 中删除文件 metastore_db/dbex.lck 吗?
Spark 正在尝试从 HDFS 加载。显然你没有安装 hadoop 并且 spark 无法连接到 HDFS。 如果你想从加载加载你必须明确指定:
file:///src/census-income.data