Spark 1.3.1 中的 LDA。将原始数据转换为术语文档矩阵?
LDA in Spark 1.3.1. Converting raw data into Term Document Matrix?
我在 Java 中尝试使用 Spark 1.3.1 的 LDA 并收到此错误:
Error: application failed with exception
org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 0.0 failed 1 times, most recent failure: Lost task 0.0 in stage 0.0 (TID 0, localhost): java.lang.NumberFormatException: For input string: "��"
我的 .txt 文件如下所示:
现在举重 发现困难的引体向上 俯卧撑
失明疾病眼睛的一切功能都很好,除了能够吸收光线使用光线形式图像
榜样孩子
亲爱的回忆童年最悲伤的记忆
这是代码:
import scala.Tuple2;
import org.apache.spark.api.java.*;
import org.apache.spark.api.java.function.Function;
import org.apache.spark.mllib.clustering.LDAModel;
import org.apache.spark.mllib.clustering.LDA;
import org.apache.spark.mllib.linalg.Matrix;
import org.apache.spark.mllib.linalg.Vector;
import org.apache.spark.mllib.linalg.Vectors;
import org.apache.spark.SparkConf;
public class JavaLDA {
public static void main(String[] args) {
SparkConf conf = new SparkConf().setAppName("LDA Example");
JavaSparkContext sc = new JavaSparkContext(conf);
// Load and parse the data
String path = "/tutorial/input/askreddit20150801.txt";
JavaRDD<String> data = sc.textFile(path);
JavaRDD<Vector> parsedData = data.map(
new Function<String, Vector>() {
public Vector call(String s) {
String[] sarray = s.trim().split(" ");
double[] values = new double[sarray.length];
for (int i = 0; i < sarray.length; i++)
values[i] = Double.parseDouble(sarray[i]);
return Vectors.dense(values);
}
}
);
// Index documents with unique IDs
JavaPairRDD<Long, Vector> corpus = JavaPairRDD.fromJavaRDD(parsedData.zipWithIndex().map(
new Function<Tuple2<Vector, Long>, Tuple2<Long, Vector>>() {
public Tuple2<Long, Vector> call(Tuple2<Vector, Long> doc_id) {
return doc_id.swap();
}
}
));
corpus.cache();
// Cluster the documents into three topics using LDA
LDAModel ldaModel = new LDA().setK(100).run(corpus);
// Output topics. Each is a distribution over words (matching word count vectors)
System.out.println("Learned topics (as distributions over vocab of " + ldaModel.vocabSize()
+ " words):");
Matrix topics = ldaModel.topicsMatrix();
for (int topic = 0; topic < 100; topic++) {
System.out.print("Topic " + topic + ":");
for (int word = 0; word < ldaModel.vocabSize(); word++) {
System.out.print(" " + topics.apply(word, topic));
}
System.out.println();
}
ldaModel.save(sc.sc(), "myLDAModel");
}
}
有人知道为什么会这样吗?我只是第一次尝试 LDA Spark。谢谢
values[i] = Double.parseDouble(sarray[i]);
为什么要尝试将文本文件中的每个单词都转换为 Double?
这就是您的问题的答案:
http://docs.oracle.com/javase/6/docs/api/java/lang/Double.html#parseDouble%28java.lang.String%29
您的代码期望输入文件是一堆看起来像数字的空格分隔文本行。假设您的文字是单词:
获取语料库中出现的每个单词的列表:
JavaRDD<String> words =
data.flatMap((FlatMapFunction<String, String>) s -> {
s = s.replaceAll("[^a-zA-Z ]", "");
s = s.toLowerCase();
return Arrays.asList(s.split(" "));
});
制作一张地图,给每个单词一个与之关联的整数:
Map<String, Long> vocab = words.zipWithIndex().collectAsMap();
然后,不要让 parsedData
做它正在做的事情,让它查找每个单词,找到相关联的数字,转到数组中的那个位置,然后为该单词的计数加 1 .
JavaRDD<Vector> tokens = data.map(
(Function<String, Vector>) s -> {
String[] vals = s.split("\s");
double[] idx = new double[vocab.size() + 1];
for (String val : vals) {
idx[vocab.get(val).intValue()] += 1.0;
}
return Vectors.dense(idx);
}
);
这会产生一个向量的 RDD,其中每个向量都是 vocab.size() 长,向量中的每个点都是该单词在该行中出现的次数的计数。
我根据我目前使用的代码稍微修改了这段代码,但没有对其进行测试,因此其中可能存在错误。祝你好运!
我在 Java 中尝试使用 Spark 1.3.1 的 LDA 并收到此错误:
Error: application failed with exception
org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 0.0 failed 1 times, most recent failure: Lost task 0.0 in stage 0.0 (TID 0, localhost): java.lang.NumberFormatException: For input string: "��"
我的 .txt 文件如下所示: 现在举重 发现困难的引体向上 俯卧撑 失明疾病眼睛的一切功能都很好,除了能够吸收光线使用光线形式图像 榜样孩子 亲爱的回忆童年最悲伤的记忆
这是代码:
import scala.Tuple2;
import org.apache.spark.api.java.*;
import org.apache.spark.api.java.function.Function;
import org.apache.spark.mllib.clustering.LDAModel;
import org.apache.spark.mllib.clustering.LDA;
import org.apache.spark.mllib.linalg.Matrix;
import org.apache.spark.mllib.linalg.Vector;
import org.apache.spark.mllib.linalg.Vectors;
import org.apache.spark.SparkConf;
public class JavaLDA {
public static void main(String[] args) {
SparkConf conf = new SparkConf().setAppName("LDA Example");
JavaSparkContext sc = new JavaSparkContext(conf);
// Load and parse the data
String path = "/tutorial/input/askreddit20150801.txt";
JavaRDD<String> data = sc.textFile(path);
JavaRDD<Vector> parsedData = data.map(
new Function<String, Vector>() {
public Vector call(String s) {
String[] sarray = s.trim().split(" ");
double[] values = new double[sarray.length];
for (int i = 0; i < sarray.length; i++)
values[i] = Double.parseDouble(sarray[i]);
return Vectors.dense(values);
}
}
);
// Index documents with unique IDs
JavaPairRDD<Long, Vector> corpus = JavaPairRDD.fromJavaRDD(parsedData.zipWithIndex().map(
new Function<Tuple2<Vector, Long>, Tuple2<Long, Vector>>() {
public Tuple2<Long, Vector> call(Tuple2<Vector, Long> doc_id) {
return doc_id.swap();
}
}
));
corpus.cache();
// Cluster the documents into three topics using LDA
LDAModel ldaModel = new LDA().setK(100).run(corpus);
// Output topics. Each is a distribution over words (matching word count vectors)
System.out.println("Learned topics (as distributions over vocab of " + ldaModel.vocabSize()
+ " words):");
Matrix topics = ldaModel.topicsMatrix();
for (int topic = 0; topic < 100; topic++) {
System.out.print("Topic " + topic + ":");
for (int word = 0; word < ldaModel.vocabSize(); word++) {
System.out.print(" " + topics.apply(word, topic));
}
System.out.println();
}
ldaModel.save(sc.sc(), "myLDAModel");
}
}
有人知道为什么会这样吗?我只是第一次尝试 LDA Spark。谢谢
values[i] = Double.parseDouble(sarray[i]);
为什么要尝试将文本文件中的每个单词都转换为 Double?
这就是您的问题的答案: http://docs.oracle.com/javase/6/docs/api/java/lang/Double.html#parseDouble%28java.lang.String%29
您的代码期望输入文件是一堆看起来像数字的空格分隔文本行。假设您的文字是单词:
获取语料库中出现的每个单词的列表:
JavaRDD<String> words =
data.flatMap((FlatMapFunction<String, String>) s -> {
s = s.replaceAll("[^a-zA-Z ]", "");
s = s.toLowerCase();
return Arrays.asList(s.split(" "));
});
制作一张地图,给每个单词一个与之关联的整数:
Map<String, Long> vocab = words.zipWithIndex().collectAsMap();
然后,不要让 parsedData
做它正在做的事情,让它查找每个单词,找到相关联的数字,转到数组中的那个位置,然后为该单词的计数加 1 .
JavaRDD<Vector> tokens = data.map(
(Function<String, Vector>) s -> {
String[] vals = s.split("\s");
double[] idx = new double[vocab.size() + 1];
for (String val : vals) {
idx[vocab.get(val).intValue()] += 1.0;
}
return Vectors.dense(idx);
}
);
这会产生一个向量的 RDD,其中每个向量都是 vocab.size() 长,向量中的每个点都是该单词在该行中出现的次数的计数。
我根据我目前使用的代码稍微修改了这段代码,但没有对其进行测试,因此其中可能存在错误。祝你好运!