如何使用 PySpark 保存 IDF 模型
How to save IDFmodel with PySpark
我用 PySpark 和 ipython notebook 生成了一个 IDFModel,如下所示:
from pyspark import SparkContext
from pyspark.mllib.feature import HashingTF
from pyspark.mllib.feature import IDF
hashingTF = HashingTF() #this will be used with hashing later
txtdata_train = sc.wholeTextFiles("/home/ubuntu/folder").sortByKey() #this returns RDD of (filename, string) pairs for each file from the directory
split_data_train = txtdata_train.map(parse) #my parse function puts RDD in form I want
tf_train = hashingTF.transform(split_data_train) #creates term frequency sparse vectors for the training set
tf_train.cache()
idf_train = IDF().fit(tf_train) #makes IDFmodel, THIS IS WHAT I WANT TO SAVE!!!
tfidf_train = idf_train.transform(tf_train)
这是基于本指南 https://spark.apache.org/docs/1.2.0/mllib-feature-extraction.html。我想保存此模型以便稍后在不同的笔记本中再次加载它。但是,没有关于如何执行此操作的信息,我找到的最接近的是:
Save Apache Spark mllib model in python
但是当我尝试答案中的建议时
idf_train.save(sc, "/home/ubuntu/newfolder")
我收到错误代码
AttributeError: 'IDFModel' object has no attribute 'save'
我是否遗漏了什么或者无法解决 IDFModel 对象?谢谢!
我在 Scala/Java 中做过类似的事情。它似乎有效,但可能效率不高。这个想法是把一个文件写成一个序列化的对象,然后再读回来。祝你好运! :)
try {
val fileOut:FileOutputStream = new FileOutputStream(savePath+"/idf.jserialized");
val out:ObjectOutputStream = new ObjectOutputStream(fileOut);
out.writeObject(idf);
out.close();
fileOut.close();
System.out.println("\nSerialization Successful... Checkout your specified output file..\n");
} catch {
case foe:FileNotFoundException => foe.printStackTrace()
case ioe:IOException => ioe.printStackTrace()
}
我用 PySpark 和 ipython notebook 生成了一个 IDFModel,如下所示:
from pyspark import SparkContext
from pyspark.mllib.feature import HashingTF
from pyspark.mllib.feature import IDF
hashingTF = HashingTF() #this will be used with hashing later
txtdata_train = sc.wholeTextFiles("/home/ubuntu/folder").sortByKey() #this returns RDD of (filename, string) pairs for each file from the directory
split_data_train = txtdata_train.map(parse) #my parse function puts RDD in form I want
tf_train = hashingTF.transform(split_data_train) #creates term frequency sparse vectors for the training set
tf_train.cache()
idf_train = IDF().fit(tf_train) #makes IDFmodel, THIS IS WHAT I WANT TO SAVE!!!
tfidf_train = idf_train.transform(tf_train)
这是基于本指南 https://spark.apache.org/docs/1.2.0/mllib-feature-extraction.html。我想保存此模型以便稍后在不同的笔记本中再次加载它。但是,没有关于如何执行此操作的信息,我找到的最接近的是:
Save Apache Spark mllib model in python
但是当我尝试答案中的建议时
idf_train.save(sc, "/home/ubuntu/newfolder")
我收到错误代码
AttributeError: 'IDFModel' object has no attribute 'save'
我是否遗漏了什么或者无法解决 IDFModel 对象?谢谢!
我在 Scala/Java 中做过类似的事情。它似乎有效,但可能效率不高。这个想法是把一个文件写成一个序列化的对象,然后再读回来。祝你好运! :)
try {
val fileOut:FileOutputStream = new FileOutputStream(savePath+"/idf.jserialized");
val out:ObjectOutputStream = new ObjectOutputStream(fileOut);
out.writeObject(idf);
out.close();
fileOut.close();
System.out.println("\nSerialization Successful... Checkout your specified output file..\n");
} catch {
case foe:FileNotFoundException => foe.printStackTrace()
case ioe:IOException => ioe.printStackTrace()
}