Mallet 中每个主题 p(w|t) 的单词分布
Distribution of words per topic p(w|t) in Mallet
我需要获取 Mallet in Java (not in the CLI as asked in how to get a probability distribution for a topic in mallet?). For an example of what I mean: Introduction to Latent Dirichlet Allocation 找到的每个主题的单词分布:
Topic A: 30% broccoli, 15% bananas, 10% breakfast, 10% munching, … (at which point, you could interpret topic A to be about food)
Topic B: 20% chinchillas, 20% kittens, 20% cute, 15% hamster, … (at which point, you could interpret topic B to be about cute animals)
Mallet 为每个主题提供令牌 "weights",并且在 http://comments.gmane.org/gmane.comp.ai.mallet.devel/2064 中有人试图编写一种方法来为 Mallet 获取每个主题的单词分布。
我修改了方法,以便所有权重除以上面邮件列表中讨论的它们的总和。
以下方法(添加到 ParallelTopicModel.java 时)是否正确计算了 Mallet 中每个主题 p(w|t) 的单词分布?
/**
* Get the normalized topic word weights (weights sum up to 1.0)
* @param topic the topic
* @return the normalized topic word weights (weights sum up to 1.0)
*/
public ArrayList<double[]> getNormalizedTopicWordWeights(int topic) {
ArrayList<double[]> tokenWeights = new ArrayList<double[]>();
for (int type = 0; type < numTypes; type++) {
int[] topicCounts = typeTopicCounts[type];
double weight = beta;
int index = 0;
while (index < topicCounts.length && topicCounts[index] > 0) {
int currentTopic = topicCounts[index] & topicMask;
if (currentTopic == topic) {
weight += topicCounts[index] >> topicBits;
break;
}
index++;
}
double[] tokenAndWeight = { (double) type, weight };
tokenWeights.add(tokenAndWeight);
}
// normalize
double sum = 0;
// get the sum
for (double[] tokenAndWeight : tokenWeights) {
sum += tokenAndWeight[1];
}
// divide each element by the sum
ArrayList<double[]> normalizedTokenWeights = new ArrayList<double[]>();
for (double[] tokenAndWeight : tokenWeights) {
tokenAndWeight[1] = tokenAndWeight[1]/sum;
normalizedTokenWeights.add(tokenAndWeight);
}
return normalizedTokenWeights;
}
这看起来可行,但我对样式有一些意见。
我并不热衷于使用 double
数组来表示 topic/weight 对。如果要遍历所有类型,为什么不使用类型为索引的密集 double[]
数组?如果您需要在该方法之外的另一种方法中对条目进行排序,则 ArrayList
可能有意义,但未规范化的中间 ArrayList
似乎很浪费。
第二个求和循环似乎没有必要。您可以先将 sum
初始化为 numTypes * beta
,然后仅当您命中非零计数类型时才添加 weight - beta
。
如果您定义 normalizer = 1.0/sum
然后在归一化循环中乘法而不是除法,通常会产生明显的差异。
我需要获取 Mallet in Java (not in the CLI as asked in how to get a probability distribution for a topic in mallet?). For an example of what I mean: Introduction to Latent Dirichlet Allocation 找到的每个主题的单词分布:
Topic A: 30% broccoli, 15% bananas, 10% breakfast, 10% munching, … (at which point, you could interpret topic A to be about food)
Topic B: 20% chinchillas, 20% kittens, 20% cute, 15% hamster, … (at which point, you could interpret topic B to be about cute animals)
Mallet 为每个主题提供令牌 "weights",并且在 http://comments.gmane.org/gmane.comp.ai.mallet.devel/2064 中有人试图编写一种方法来为 Mallet 获取每个主题的单词分布。
我修改了方法,以便所有权重除以上面邮件列表中讨论的它们的总和。
以下方法(添加到 ParallelTopicModel.java 时)是否正确计算了 Mallet 中每个主题 p(w|t) 的单词分布?
/**
* Get the normalized topic word weights (weights sum up to 1.0)
* @param topic the topic
* @return the normalized topic word weights (weights sum up to 1.0)
*/
public ArrayList<double[]> getNormalizedTopicWordWeights(int topic) {
ArrayList<double[]> tokenWeights = new ArrayList<double[]>();
for (int type = 0; type < numTypes; type++) {
int[] topicCounts = typeTopicCounts[type];
double weight = beta;
int index = 0;
while (index < topicCounts.length && topicCounts[index] > 0) {
int currentTopic = topicCounts[index] & topicMask;
if (currentTopic == topic) {
weight += topicCounts[index] >> topicBits;
break;
}
index++;
}
double[] tokenAndWeight = { (double) type, weight };
tokenWeights.add(tokenAndWeight);
}
// normalize
double sum = 0;
// get the sum
for (double[] tokenAndWeight : tokenWeights) {
sum += tokenAndWeight[1];
}
// divide each element by the sum
ArrayList<double[]> normalizedTokenWeights = new ArrayList<double[]>();
for (double[] tokenAndWeight : tokenWeights) {
tokenAndWeight[1] = tokenAndWeight[1]/sum;
normalizedTokenWeights.add(tokenAndWeight);
}
return normalizedTokenWeights;
}
这看起来可行,但我对样式有一些意见。
我并不热衷于使用 double
数组来表示 topic/weight 对。如果要遍历所有类型,为什么不使用类型为索引的密集 double[]
数组?如果您需要在该方法之外的另一种方法中对条目进行排序,则 ArrayList
可能有意义,但未规范化的中间 ArrayList
似乎很浪费。
第二个求和循环似乎没有必要。您可以先将 sum
初始化为 numTypes * beta
,然后仅当您命中非零计数类型时才添加 weight - beta
。
如果您定义 normalizer = 1.0/sum
然后在归一化循环中乘法而不是除法,通常会产生明显的差异。