Mahout 聚类 - 单个聚类中的所有文本向量 - 为什么?
Mahout clustering - all text vectors in single cluster - why?
我运行下面的例子:
Document 1 -> John saw a red car.
Document 2 -> Marta found a red bike.
Document 3 -> Don need a blue coat.
Document 4 -> Mike bought a blue boat.
Document 5 -> Albert wants a blue dish.
Document 6 -> Lara likes blue glasses.
Document 7 -> Donna, do you have red apples?
Document 8 -> Sonia needs blue books.
Document 9 -> I like blue eyes.
Document 10 -> Arleen has a red carpet.
并且它与 EuclideanDistanceMeasure
一起按预期工作。但我不确定为什么文本预期的距离度量(TanimotoDistanceMeasure
和 CosineDistanceMeasure
)只给我一个集群。
这是为什么?我并没有假装我对这两个结果不尽人意的距离测量一无所知——但我可能需要改变什么?里面的数字太多了,我无法理解每个数字的影响。我有这本书 "Mahout in Action" 虽然我只读了 2 章。
EuclideanDistanceMeasure(2 个聚类 - 好)
Clusters:
7 -> wt: 1.0 distance: 4.4960791719810365 vec: Document 1 = [8:2.609, 21:2.609, 29:1.693, 30:2.609]
7 -> wt: 1.0 distance: 4.496079376645949 vec: Document 10 = [2:2.609, 9:2.609, 18:2.609, 29:1.693]
7 -> wt: 1.0 distance: 4.496079576525459 vec: Document 2 = [3:2.609, 16:2.609, 25:2.609, 29:1.693]
9 -> wt: 1.0 distance: 4.389955960700927 vec: Document 3 = [4:1.357, 10:2.609, 13:2.609, 27:2.609]
9 -> wt: 1.0 distance: 4.389956011306051 vec: Document 4 = [4:1.357, 5:2.609, 7:2.609, 26:2.609]
9 -> wt: 1.0 distance: 4.3899560687101395 vec: Document 5 = [0:2.609, 4:1.357, 11:2.609, 32:2.609]
9 -> wt: 1.0 distance: 4.389956137136399 vec: Document 6 = [4:1.357, 17:2.609, 22:2.609, 24:2.609]
7 -> wt: 1.0 distance: 5.577549042707083 vec: Document 7 = [1:2.609, 12:2.609, 14:2.609, 19:2.609, 29:1.693, 33:2.609]
9 -> wt: 1.0 distance: 4.389956708176695 vec: Document 8 = [4:1.357, 6:2.609, 28:2.609, 31:2.609]
9 -> wt: 1.0 distance: 4.389471924190491 vec: Document 9 = [4:1.357, 15:2.609, 20:2.609, 23:2.609]
制作人:
CanopyDriver.run(new Path(vectorsFolder), new Path(canopyCentroids), new EuclideanDistanceMeasure(), 20, 5,
true, 0, true);
FuzzyKMeansDriver.run(new Path(vectorsFolder), new Path(canopyCentroids, "clusters-0-final"),
new Path(clusterOutput), 0.01, 20, 2, true, true, 0, false);
CosineDistanceMeasure(只有 1 个簇 - 不好)
Clusters:
0 -> wt: 1.0 distance: 0.6362357041216559 vec: Document 1 = [8:2.609, 21:2.609, 29:1.693, 30:2.609]
0 -> wt: 1.0 distance: 0.6362357041216559 vec: Document 10 = [2:2.609, 9:2.609, 18:2.609, 29:1.693]
0 -> wt: 1.0 distance: 0.636235704121656 vec: Document 2 = [3:2.609, 16:2.609, 25:2.609, 29:1.693]
0 -> wt: 1.0 distance: 0.6328896123664868 vec: Document 3 = [4:1.357, 10:2.609, 13:2.609, 27:2.609]
0 -> wt: 1.0 distance: 0.6328896123664868 vec: Document 4 = [4:1.357, 5:2.609, 7:2.609, 26:2.609]
0 -> wt: 1.0 distance: 0.6328896123664868 vec: Document 5 = [0:2.609, 4:1.357, 11:2.609, 32:2.609]
0 -> wt: 1.0 distance: 0.6328896123664868 vec: Document 6 = [4:1.357, 17:2.609, 22:2.609, 24:2.609]
0 -> wt: 1.0 distance: 0.5876411474816594 vec: Document 7 = [1:2.609, 12:2.609, 14:2.609, 19:2.609, 29:1.693, 33:2.609]
0 -> wt: 1.0 distance: 0.6328896123664868 vec: Document 8 = [4:1.357, 6:2.609, 28:2.609, 31:2.609]
0 -> wt: 1.0 distance: 0.6328896123664868 vec: Document 9 = [4:1.357, 15:2.609, 20:2.609, 23:2.609]
制作人
CanopyDriver.run(new Path(vectorsFolder), new Path(canopyCentroids), new CosineDistanceMeasure(), 20, 5,
true, 0, true);
FuzzyKMeansDriver.run(new Path(vectorsFolder), new Path(canopyCentroids, "clusters-0-final"),
new Path(clusterOutput), 0.01, 20, 2, true, true, 0, false);
TanimotoDistanceMeasure(只有 1 个簇 - 不好)
Clusters:
0 -> wt: 1.0 distance: 0.8637279689324617 vec: Document 1 = [8:2.609, 21:2.609, 29:1.693, 30:2.609]
0 -> wt: 1.0 distance: 0.8637279689324617 vec: Document 10 = [2:2.609, 9:2.609, 18:2.609, 29:1.693]
0 -> wt: 1.0 distance: 0.8637279689324617 vec: Document 2 = [3:2.609, 16:2.609, 25:2.609, 29:1.693]
0 -> wt: 1.0 distance: 0.8596377086023765 vec: Document 3 = [4:1.357, 10:2.609, 13:2.609, 27:2.609]
0 -> wt: 1.0 distance: 0.8596377086023765 vec: Document 4 = [4:1.357, 5:2.609, 7:2.609, 26:2.609]
0 -> wt: 1.0 distance: 0.8596377086023765 vec: Document 5 = [0:2.609, 4:1.357, 11:2.609, 32:2.609]
0 -> wt: 1.0 distance: 0.8596377086023765 vec: Document 6 = [4:1.357, 17:2.609, 22:2.609, 24:2.609]
0 -> wt: 1.0 distance: 0.8723755210900389 vec: Document 7 = [1:2.609, 12:2.609, 14:2.609, 19:2.609, 29:1.693, 33:2.609]
0 -> wt: 1.0 distance: 0.8596377086023765 vec: Document 8 = [4:1.357, 6:2.609, 28:2.609, 31:2.609]
0 -> wt: 1.0 distance: 0.8596377086023765 vec: Document 9 = [4:1.357, 15:2.609, 20:2.609, 23:2.609]
生产于
CanopyDriver.run(new Path(vectorsFolder), new Path(canopyCentroids), new TanimotoDistanceMeasure(), 20, 5,
true, 0, true);
FuzzyKMeansDriver.run(new Path(vectorsFolder), new Path(canopyCentroids, "clusters-0-final"),
new Path(clusterOutput), 0.01, 20, 2, true, true, 0, false);
正如 Anony-Mousse 在他的第一个回复中所说,我提供给它的数据属于一个集群。在最近几周的一些灵魂搜索之后(或者更具体地说,直接用距离测量 classes 进行实验),我发现了一个数据集,它产生了不止一个集群:
1) 确保数据足够不同
Text id1 = new Text("Document 1");
Text text1 = new Text("Atletico Madrid win");
writer.append(id1, text1);
Text id6 = new Text("Document 6");
Text text6 = new Text("Both apple and orange are fruit");
writer.append(id6, text6);
Text id7 = new Text("Document 7");
Text text7 = new Text("Both orange and apple are fruit");
writer.append(id7, text7);
2) 确定好的半径值
a) 使用您的示例数据
试验 DistanceMeasure class
Vector v1 = toVector("Atletico Madrid win");
Vector v2 = toVector("Both apple and orange are fruit");
Vector v3 = toVector("Both orange and apple are fruit");
of = ImmutableList.of(v1, v2, v3);
List<Vector> vectorList = new LinkedList();
vectorList.addAll(of);
List<Canopy> canopies = CanopyClusterer.createCanopies(vectorList, new CosineDistanceMeasure(), 0.3, 0.3);
for (Canopy canopy : canopies) {
System.out.println("DistanceMeasureMain.main() " + canopy.asFormatString());
}
产生:
DistanceMeasureMain.main() distance is 0.19193857965451055
DistanceMeasureMain.main() distance is 0.5281191379648771
DistanceMeasureMain.main() distance is 0.19193857965451055
DistanceMeasureMain.main() C0: {0:1.1,117724:1.0,378550445:1.0,1997849123:1.0}
DistanceMeasureMain.main() C1: {0:1.1,96727:1.0,96852:1.0,2076577:1.0,93029210:1.0,97711124:1.0,1008851410:1.0}
b) 使用距离作为半径值
我认为 t1
和 t2
值(0.2
和 0.2
)对于 CanopyDriver.run()
也很重要,虽然我不知道下面详细说明调用中所有数值参数的影响:
// CosineDistanceMeasure
CanopyDriver.run(new Path(vectorsFolder),
new Path(canopyCentroids), new CosineDistanceMeasure(),
0.2, 0.2, true, 1, true);
FuzzyKMeansDriver.run(new Path(vectorsFolder), new Path(
canopyCentroids, "clusters-0-final"), new Path(
clusterOutput), 0.01, 20, 2, true, true, 0, false);
输出
Document 1 -> Atletico Madrid win
Document 6 -> Both apple and orange are fruit
Document 7 -> Both orange and apple are fruit
Clusters:
0 -> wt: 1.0 distance: 0.0 vec: Document 1 = [1:1.405, 4:1.405, 6:1.405]
1 -> wt: 1.0 distance: 0.0 vec: Document 6 = [0:1.000, 2:1.000, 3:1.000, 5:1.000]
1 -> wt: 1.0 distance: 0.0 vec: Document 7 = [0:1.000, 2:1.000, 3:1.000, 5:1.000]
我运行下面的例子:
Document 1 -> John saw a red car.
Document 2 -> Marta found a red bike.
Document 3 -> Don need a blue coat.
Document 4 -> Mike bought a blue boat.
Document 5 -> Albert wants a blue dish.
Document 6 -> Lara likes blue glasses.
Document 7 -> Donna, do you have red apples?
Document 8 -> Sonia needs blue books.
Document 9 -> I like blue eyes.
Document 10 -> Arleen has a red carpet.
并且它与 EuclideanDistanceMeasure
一起按预期工作。但我不确定为什么文本预期的距离度量(TanimotoDistanceMeasure
和 CosineDistanceMeasure
)只给我一个集群。
这是为什么?我并没有假装我对这两个结果不尽人意的距离测量一无所知——但我可能需要改变什么?里面的数字太多了,我无法理解每个数字的影响。我有这本书 "Mahout in Action" 虽然我只读了 2 章。
EuclideanDistanceMeasure(2 个聚类 - 好)
Clusters:
7 -> wt: 1.0 distance: 4.4960791719810365 vec: Document 1 = [8:2.609, 21:2.609, 29:1.693, 30:2.609]
7 -> wt: 1.0 distance: 4.496079376645949 vec: Document 10 = [2:2.609, 9:2.609, 18:2.609, 29:1.693]
7 -> wt: 1.0 distance: 4.496079576525459 vec: Document 2 = [3:2.609, 16:2.609, 25:2.609, 29:1.693]
9 -> wt: 1.0 distance: 4.389955960700927 vec: Document 3 = [4:1.357, 10:2.609, 13:2.609, 27:2.609]
9 -> wt: 1.0 distance: 4.389956011306051 vec: Document 4 = [4:1.357, 5:2.609, 7:2.609, 26:2.609]
9 -> wt: 1.0 distance: 4.3899560687101395 vec: Document 5 = [0:2.609, 4:1.357, 11:2.609, 32:2.609]
9 -> wt: 1.0 distance: 4.389956137136399 vec: Document 6 = [4:1.357, 17:2.609, 22:2.609, 24:2.609]
7 -> wt: 1.0 distance: 5.577549042707083 vec: Document 7 = [1:2.609, 12:2.609, 14:2.609, 19:2.609, 29:1.693, 33:2.609]
9 -> wt: 1.0 distance: 4.389956708176695 vec: Document 8 = [4:1.357, 6:2.609, 28:2.609, 31:2.609]
9 -> wt: 1.0 distance: 4.389471924190491 vec: Document 9 = [4:1.357, 15:2.609, 20:2.609, 23:2.609]
制作人:
CanopyDriver.run(new Path(vectorsFolder), new Path(canopyCentroids), new EuclideanDistanceMeasure(), 20, 5,
true, 0, true);
FuzzyKMeansDriver.run(new Path(vectorsFolder), new Path(canopyCentroids, "clusters-0-final"),
new Path(clusterOutput), 0.01, 20, 2, true, true, 0, false);
CosineDistanceMeasure(只有 1 个簇 - 不好)
Clusters:
0 -> wt: 1.0 distance: 0.6362357041216559 vec: Document 1 = [8:2.609, 21:2.609, 29:1.693, 30:2.609]
0 -> wt: 1.0 distance: 0.6362357041216559 vec: Document 10 = [2:2.609, 9:2.609, 18:2.609, 29:1.693]
0 -> wt: 1.0 distance: 0.636235704121656 vec: Document 2 = [3:2.609, 16:2.609, 25:2.609, 29:1.693]
0 -> wt: 1.0 distance: 0.6328896123664868 vec: Document 3 = [4:1.357, 10:2.609, 13:2.609, 27:2.609]
0 -> wt: 1.0 distance: 0.6328896123664868 vec: Document 4 = [4:1.357, 5:2.609, 7:2.609, 26:2.609]
0 -> wt: 1.0 distance: 0.6328896123664868 vec: Document 5 = [0:2.609, 4:1.357, 11:2.609, 32:2.609]
0 -> wt: 1.0 distance: 0.6328896123664868 vec: Document 6 = [4:1.357, 17:2.609, 22:2.609, 24:2.609]
0 -> wt: 1.0 distance: 0.5876411474816594 vec: Document 7 = [1:2.609, 12:2.609, 14:2.609, 19:2.609, 29:1.693, 33:2.609]
0 -> wt: 1.0 distance: 0.6328896123664868 vec: Document 8 = [4:1.357, 6:2.609, 28:2.609, 31:2.609]
0 -> wt: 1.0 distance: 0.6328896123664868 vec: Document 9 = [4:1.357, 15:2.609, 20:2.609, 23:2.609]
制作人
CanopyDriver.run(new Path(vectorsFolder), new Path(canopyCentroids), new CosineDistanceMeasure(), 20, 5,
true, 0, true);
FuzzyKMeansDriver.run(new Path(vectorsFolder), new Path(canopyCentroids, "clusters-0-final"),
new Path(clusterOutput), 0.01, 20, 2, true, true, 0, false);
TanimotoDistanceMeasure(只有 1 个簇 - 不好)
Clusters:
0 -> wt: 1.0 distance: 0.8637279689324617 vec: Document 1 = [8:2.609, 21:2.609, 29:1.693, 30:2.609]
0 -> wt: 1.0 distance: 0.8637279689324617 vec: Document 10 = [2:2.609, 9:2.609, 18:2.609, 29:1.693]
0 -> wt: 1.0 distance: 0.8637279689324617 vec: Document 2 = [3:2.609, 16:2.609, 25:2.609, 29:1.693]
0 -> wt: 1.0 distance: 0.8596377086023765 vec: Document 3 = [4:1.357, 10:2.609, 13:2.609, 27:2.609]
0 -> wt: 1.0 distance: 0.8596377086023765 vec: Document 4 = [4:1.357, 5:2.609, 7:2.609, 26:2.609]
0 -> wt: 1.0 distance: 0.8596377086023765 vec: Document 5 = [0:2.609, 4:1.357, 11:2.609, 32:2.609]
0 -> wt: 1.0 distance: 0.8596377086023765 vec: Document 6 = [4:1.357, 17:2.609, 22:2.609, 24:2.609]
0 -> wt: 1.0 distance: 0.8723755210900389 vec: Document 7 = [1:2.609, 12:2.609, 14:2.609, 19:2.609, 29:1.693, 33:2.609]
0 -> wt: 1.0 distance: 0.8596377086023765 vec: Document 8 = [4:1.357, 6:2.609, 28:2.609, 31:2.609]
0 -> wt: 1.0 distance: 0.8596377086023765 vec: Document 9 = [4:1.357, 15:2.609, 20:2.609, 23:2.609]
生产于
CanopyDriver.run(new Path(vectorsFolder), new Path(canopyCentroids), new TanimotoDistanceMeasure(), 20, 5,
true, 0, true);
FuzzyKMeansDriver.run(new Path(vectorsFolder), new Path(canopyCentroids, "clusters-0-final"),
new Path(clusterOutput), 0.01, 20, 2, true, true, 0, false);
正如 Anony-Mousse 在他的第一个回复中所说,我提供给它的数据属于一个集群。在最近几周的一些灵魂搜索之后(或者更具体地说,直接用距离测量 classes 进行实验),我发现了一个数据集,它产生了不止一个集群:
1) 确保数据足够不同
Text id1 = new Text("Document 1");
Text text1 = new Text("Atletico Madrid win");
writer.append(id1, text1);
Text id6 = new Text("Document 6");
Text text6 = new Text("Both apple and orange are fruit");
writer.append(id6, text6);
Text id7 = new Text("Document 7");
Text text7 = new Text("Both orange and apple are fruit");
writer.append(id7, text7);
2) 确定好的半径值
a) 使用您的示例数据
试验 DistanceMeasure classVector v1 = toVector("Atletico Madrid win");
Vector v2 = toVector("Both apple and orange are fruit");
Vector v3 = toVector("Both orange and apple are fruit");
of = ImmutableList.of(v1, v2, v3);
List<Vector> vectorList = new LinkedList();
vectorList.addAll(of);
List<Canopy> canopies = CanopyClusterer.createCanopies(vectorList, new CosineDistanceMeasure(), 0.3, 0.3);
for (Canopy canopy : canopies) {
System.out.println("DistanceMeasureMain.main() " + canopy.asFormatString());
}
产生:
DistanceMeasureMain.main() distance is 0.19193857965451055
DistanceMeasureMain.main() distance is 0.5281191379648771
DistanceMeasureMain.main() distance is 0.19193857965451055
DistanceMeasureMain.main() C0: {0:1.1,117724:1.0,378550445:1.0,1997849123:1.0}
DistanceMeasureMain.main() C1: {0:1.1,96727:1.0,96852:1.0,2076577:1.0,93029210:1.0,97711124:1.0,1008851410:1.0}
b) 使用距离作为半径值
我认为 t1
和 t2
值(0.2
和 0.2
)对于 CanopyDriver.run()
也很重要,虽然我不知道下面详细说明调用中所有数值参数的影响:
// CosineDistanceMeasure
CanopyDriver.run(new Path(vectorsFolder),
new Path(canopyCentroids), new CosineDistanceMeasure(),
0.2, 0.2, true, 1, true);
FuzzyKMeansDriver.run(new Path(vectorsFolder), new Path(
canopyCentroids, "clusters-0-final"), new Path(
clusterOutput), 0.01, 20, 2, true, true, 0, false);
输出
Document 1 -> Atletico Madrid win
Document 6 -> Both apple and orange are fruit
Document 7 -> Both orange and apple are fruit
Clusters:
0 -> wt: 1.0 distance: 0.0 vec: Document 1 = [1:1.405, 4:1.405, 6:1.405]
1 -> wt: 1.0 distance: 0.0 vec: Document 6 = [0:1.000, 2:1.000, 3:1.000, 5:1.000]
1 -> wt: 1.0 distance: 0.0 vec: Document 7 = [0:1.000, 2:1.000, 3:1.000, 5:1.000]