结合 Spark Streaming + MLlib
Combining Spark Streaming + MLlib
我尝试使用随机森林 model 来预测示例流,但我似乎无法使用 model 对示例进行分类。
这是 pyspark 中使用的代码:
sc = SparkContext(appName="App")
model = RandomForest.trainClassifier(trainingData, numClasses=2, categoricalFeaturesInfo={}, impurity='gini', numTrees=150)
ssc = StreamingContext(sc, 1)
lines = ssc.socketTextStream(hostname, int(port))
parsedLines = lines.map(parse)
parsedLines.pprint()
predictions = parsedLines.map(lambda event: model.predict(event.features))
在集群中编译时返回错误:
Error : "It appears that you are attempting to reference SparkContext from a broadcast "
Exception: It appears that you are attempting to reference SparkContext from a broadcast variable, action, or transformation. SparkContext can only be used on the driver, not in code that it run on workers. For more information, see SPARK-5063.
有没有办法使用从静态数据生成的 modèle 来预测流式示例?
谢谢你们,我真的很感激!!!!
是的,您可以使用从静态数据生成的模型。您遇到的问题与流媒体根本无关。您根本无法在动作或转换中使用基于 JVM 的模型(有关原因的解释,请参阅 )。相反,您应该将 predict
方法应用于完整的 RDD
,例如在 DStream
:
上使用 transform
from pyspark.mllib.tree import RandomForest
from pyspark.mllib.util import MLUtils
from pyspark import SparkContext
from pyspark.streaming import StreamingContext
from operator import attrgetter
sc = SparkContext("local[2]", "foo")
ssc = StreamingContext(sc, 1)
data = MLUtils.loadLibSVMFile(sc, 'data/mllib/sample_libsvm_data.txt')
trainingData, testData = data.randomSplit([0.7, 0.3])
model = RandomForest.trainClassifier(
trainingData, numClasses=2, nmTrees=3
)
(ssc
.queueStream([testData])
# Extract features
.map(attrgetter("features"))
# Predict
.transform(lambda _, rdd: model.predict(rdd))
.pprint())
ssc.start()
ssc.awaitTerminationOrTimeout(10)
我尝试使用随机森林 model 来预测示例流,但我似乎无法使用 model 对示例进行分类。 这是 pyspark 中使用的代码:
sc = SparkContext(appName="App")
model = RandomForest.trainClassifier(trainingData, numClasses=2, categoricalFeaturesInfo={}, impurity='gini', numTrees=150)
ssc = StreamingContext(sc, 1)
lines = ssc.socketTextStream(hostname, int(port))
parsedLines = lines.map(parse)
parsedLines.pprint()
predictions = parsedLines.map(lambda event: model.predict(event.features))
在集群中编译时返回错误:
Error : "It appears that you are attempting to reference SparkContext from a broadcast "
Exception: It appears that you are attempting to reference SparkContext from a broadcast variable, action, or transformation. SparkContext can only be used on the driver, not in code that it run on workers. For more information, see SPARK-5063.
有没有办法使用从静态数据生成的 modèle 来预测流式示例?
谢谢你们,我真的很感激!!!!
是的,您可以使用从静态数据生成的模型。您遇到的问题与流媒体根本无关。您根本无法在动作或转换中使用基于 JVM 的模型(有关原因的解释,请参阅 predict
方法应用于完整的 RDD
,例如在 DStream
:
transform
from pyspark.mllib.tree import RandomForest
from pyspark.mllib.util import MLUtils
from pyspark import SparkContext
from pyspark.streaming import StreamingContext
from operator import attrgetter
sc = SparkContext("local[2]", "foo")
ssc = StreamingContext(sc, 1)
data = MLUtils.loadLibSVMFile(sc, 'data/mllib/sample_libsvm_data.txt')
trainingData, testData = data.randomSplit([0.7, 0.3])
model = RandomForest.trainClassifier(
trainingData, numClasses=2, nmTrees=3
)
(ssc
.queueStream([testData])
# Extract features
.map(attrgetter("features"))
# Predict
.transform(lambda _, rdd: model.predict(rdd))
.pprint())
ssc.start()
ssc.awaitTerminationOrTimeout(10)