运行 pyspark.mllib 在 Ubuntu
running pyspark.mllib on Ubuntu
我正在尝试 link 在 python 中激发灵感。下面的代码是test.py
,我把它放在~/spark/python
:
下面
from pyspark import SparkContext, SparkConf
from pyspark.mllib.fpm import FPGrowth
conf = SparkConf().setAppName(appName).setMaster(master)
sc = SparkContext(conf=conf)
data = sc.textFile("data/mllib/sample_fpgrowth.txt")
transactions = data.map(lambda line: line.strip().split(' '))
model = FPGrowth.train(transactions, minSupport=0.2, numPartitions=10)
result = model.freqItemsets().collect()
for fi in result:
print(fi)
我 运行 python test.py
收到此错误消息:
Exception in thread "main" java.lang.IllegalStateException: Library directory '/home/user/spark/lib_managed/jars' does not exist.
at org.apache.spark.launcher.CommandBuilderUtils.checkState(CommandBuilderUtils.java:249)
at org.apache.spark.launcher.AbstractCommandBuilder.buildClassPath(AbstractCommandBuilder.java:208)
at org.apache.spark.launcher.AbstractCommandBuilder.buildJavaCommand(AbstractCommandBuilder.java:119)
at org.apache.spark.launcher.SparkSubmitCommandBuilder.buildSparkSubmitCommand(SparkSubmitCommandBuilder.java:195)
at org.apache.spark.launcher.SparkSubmitCommandBuilder.buildCommand(SparkSubmitCommandBuilder.java:121)
at org.apache.spark.launcher.Main.main(Main.java:86)
Traceback (most recent call last):
File "test.py", line 6, in <module>
conf = SparkConf().setAppName(appName).setMaster(master)
File "/home/user/spark/python/pyspark/conf.py", line 104, in __init__
SparkContext._ensure_initialized()
File "/home/user/spark/python/pyspark/context.py", line 245, in _ensure_initialized
SparkContext._gateway = gateway or launch_gateway()
File "/home/user/spark/python/pyspark/java_gateway.py", line 94, in launch_gateway
raise Exception("Java gateway process exited before sending the driver its port number")
Exception: Java gateway process exited before sending the driver its port number
我将 test.py
移动到 ~/spark
,我得到:
Traceback (most recent call last):
File "test.py", line 1, in <module>
from pyspark import SparkContext, SparkConf
ImportError: No module named pyspark
我从官方网站克隆了Spark项目。
OS系统:Ubuntu
Java 版本:1.7.0_79
Python版本:2.7.11
任何人都可以给我一些解决这个问题的提示吗?
如果您还没有设置 SPARK_HOME
,请检查 并将其库添加到 PYTHONPATH
。
此外,
I clone Spark project from the official website
不推荐这样做,因为它可能会产生很多依赖性问题。你可以试试 download a pre-built version with Hadoop, then test it in the local mode with instructions here.
Spark 程序必须通过 "Spark-submit" 提交。更多信息:Documentation.
您应该尝试 运行:$SPARK_HOME/bin/spark-submit test.py
而不是 python test.py
。
我正在尝试 link 在 python 中激发灵感。下面的代码是test.py
,我把它放在~/spark/python
:
from pyspark import SparkContext, SparkConf
from pyspark.mllib.fpm import FPGrowth
conf = SparkConf().setAppName(appName).setMaster(master)
sc = SparkContext(conf=conf)
data = sc.textFile("data/mllib/sample_fpgrowth.txt")
transactions = data.map(lambda line: line.strip().split(' '))
model = FPGrowth.train(transactions, minSupport=0.2, numPartitions=10)
result = model.freqItemsets().collect()
for fi in result:
print(fi)
我 运行 python test.py
收到此错误消息:
Exception in thread "main" java.lang.IllegalStateException: Library directory '/home/user/spark/lib_managed/jars' does not exist.
at org.apache.spark.launcher.CommandBuilderUtils.checkState(CommandBuilderUtils.java:249)
at org.apache.spark.launcher.AbstractCommandBuilder.buildClassPath(AbstractCommandBuilder.java:208)
at org.apache.spark.launcher.AbstractCommandBuilder.buildJavaCommand(AbstractCommandBuilder.java:119)
at org.apache.spark.launcher.SparkSubmitCommandBuilder.buildSparkSubmitCommand(SparkSubmitCommandBuilder.java:195)
at org.apache.spark.launcher.SparkSubmitCommandBuilder.buildCommand(SparkSubmitCommandBuilder.java:121)
at org.apache.spark.launcher.Main.main(Main.java:86)
Traceback (most recent call last):
File "test.py", line 6, in <module>
conf = SparkConf().setAppName(appName).setMaster(master)
File "/home/user/spark/python/pyspark/conf.py", line 104, in __init__
SparkContext._ensure_initialized()
File "/home/user/spark/python/pyspark/context.py", line 245, in _ensure_initialized
SparkContext._gateway = gateway or launch_gateway()
File "/home/user/spark/python/pyspark/java_gateway.py", line 94, in launch_gateway
raise Exception("Java gateway process exited before sending the driver its port number")
Exception: Java gateway process exited before sending the driver its port number
我将 test.py
移动到 ~/spark
,我得到:
Traceback (most recent call last):
File "test.py", line 1, in <module>
from pyspark import SparkContext, SparkConf
ImportError: No module named pyspark
我从官方网站克隆了Spark项目。 OS系统:Ubuntu Java 版本:1.7.0_79 Python版本:2.7.11
任何人都可以给我一些解决这个问题的提示吗?
如果您还没有设置 SPARK_HOME
,请检查 PYTHONPATH
。
此外,
I clone Spark project from the official website
不推荐这样做,因为它可能会产生很多依赖性问题。你可以试试 download a pre-built version with Hadoop, then test it in the local mode with instructions here.
Spark 程序必须通过 "Spark-submit" 提交。更多信息:Documentation.
您应该尝试 运行:$SPARK_HOME/bin/spark-submit test.py
而不是 python test.py
。