每次 运行 我的代码都会得到不同的结果
I'm getting different results every time I run my code
我正在使用 ELKI
对我的数据进行聚类 我使用 KMeansLloyd<NumberVector>
with k=3
每次我 运行 我的 java 代码我完全不同的集群结果,这是正常的还是我应该做些什么来使我的输出接近稳定?这是我从 elki 教程
获得的代码
DatabaseConnection dbc = new ArrayAdapterDatabaseConnection(a);
// Create a database (which may contain multiple relations!)
Database db = new StaticArrayDatabase(dbc, null);
// Load the data into the database (do NOT forget to initialize...)
db.initialize();
// Relation containing the number vectors:
Relation<NumberVector> rel = db.getRelation(TypeUtil.NUMBER_VECTOR_FIELD);
// We know that the ids must be a continuous range:
DBIDRange ids = (DBIDRange) rel.getDBIDs();
// K-means should be used with squared Euclidean (least squares):
//SquaredEuclideanDistanceFunction dist = SquaredEuclideanDistanceFunction.STATIC;
CosineDistanceFunction dist= CosineDistanceFunction.STATIC;
// Default initialization, using global random:
// To fix the random seed, use: new RandomFactory(seed);
RandomlyGeneratedInitialMeans init = new RandomlyGeneratedInitialMeans(RandomFactory.DEFAULT);
// Textbook k-means clustering:
KMeansLloyd<NumberVector> km = new KMeansLloyd<>(dist, //
3 /* k - number of partitions */, //
0 /* maximum number of iterations: no limit */, init);
// K-means will automatically choose a numerical relation from the data set:
// But we could make it explicit (if there were more than one numeric
// relation!): km.run(db, rel);
Clustering<KMeansModel> c = km.run(db);
// Output all clusters:
int i = 0;
for(Cluster<KMeansModel> clu : c.getAllClusters()) {
// K-means will name all clusters "Cluster" in lack of noise support:
System.out.println("#" + i + ": " + clu.getNameAutomatic());
System.out.println("Size: " + clu.size());
System.out.println("Center: " + clu.getModel().getPrototype().toString());
// Iterate over objects:
System.out.print("Objects: ");
for(DBIDIter it = clu.getIDs().iter(); it.valid(); it.advance()) {
// To get the vector use:
NumberVector v = rel.get(it);
// Offset within our DBID range: "line number"
final int offset = ids.getOffset(it);
System.out.print(v+" " + offset);
// Do NOT rely on using "internalGetIndex()" directly!
}
System.out.println();
++i;
}
我会说,因为您正在使用 RandomlyGeneratedInitialMeans
:
Initialize k-means by generating random vectors (within the data sets value range).
RandomlyGeneratedInitialMeans init = new RandomlyGeneratedInitialMeans(RandomFactory.DEFAULT);
是的,很正常。
评论太长了。正如@Idos 所述,您正在随机初始化数据;这就是为什么你得到随机结果。现在的问题是,您如何确保结果稳健?试试这个:
运行算法N
次。每次,记录每个观察的聚类成员。完成后,将观察结果分类到最常包含它的集群中。例如,假设您有 3 个观察值,3 个 类,并且 运行 算法 3 次:
obs R1 R2 R3
1 A A B
2 B B B
3 C B B
那么您应该将 obs1
分类为 A
,因为它最常被分类为 A
。将 obs2
分类为 B
,因为它始终被分类为 B
。并将 obs3
分类为 B
,因为算法最常将其分类为 B
。 运行 算法次数越多,结果应该会变得越稳定。
K-Means 应该 随机初始化。 希望多次运行得到不同的结果。
如果您不想这样,使用固定的随机种子。
来自您复制粘贴的代码:
// To fix the random seed, use: new RandomFactory(seed);
这正是你应该做的...
long seed = 0;
RandomlyGeneratedInitialMeans init = new RandomlyGeneratedInitialMeans(
new RandomFactory(seed));
我正在使用 ELKI
对我的数据进行聚类 我使用 KMeansLloyd<NumberVector>
with k=3
每次我 运行 我的 java 代码我完全不同的集群结果,这是正常的还是我应该做些什么来使我的输出接近稳定?这是我从 elki 教程
DatabaseConnection dbc = new ArrayAdapterDatabaseConnection(a);
// Create a database (which may contain multiple relations!)
Database db = new StaticArrayDatabase(dbc, null);
// Load the data into the database (do NOT forget to initialize...)
db.initialize();
// Relation containing the number vectors:
Relation<NumberVector> rel = db.getRelation(TypeUtil.NUMBER_VECTOR_FIELD);
// We know that the ids must be a continuous range:
DBIDRange ids = (DBIDRange) rel.getDBIDs();
// K-means should be used with squared Euclidean (least squares):
//SquaredEuclideanDistanceFunction dist = SquaredEuclideanDistanceFunction.STATIC;
CosineDistanceFunction dist= CosineDistanceFunction.STATIC;
// Default initialization, using global random:
// To fix the random seed, use: new RandomFactory(seed);
RandomlyGeneratedInitialMeans init = new RandomlyGeneratedInitialMeans(RandomFactory.DEFAULT);
// Textbook k-means clustering:
KMeansLloyd<NumberVector> km = new KMeansLloyd<>(dist, //
3 /* k - number of partitions */, //
0 /* maximum number of iterations: no limit */, init);
// K-means will automatically choose a numerical relation from the data set:
// But we could make it explicit (if there were more than one numeric
// relation!): km.run(db, rel);
Clustering<KMeansModel> c = km.run(db);
// Output all clusters:
int i = 0;
for(Cluster<KMeansModel> clu : c.getAllClusters()) {
// K-means will name all clusters "Cluster" in lack of noise support:
System.out.println("#" + i + ": " + clu.getNameAutomatic());
System.out.println("Size: " + clu.size());
System.out.println("Center: " + clu.getModel().getPrototype().toString());
// Iterate over objects:
System.out.print("Objects: ");
for(DBIDIter it = clu.getIDs().iter(); it.valid(); it.advance()) {
// To get the vector use:
NumberVector v = rel.get(it);
// Offset within our DBID range: "line number"
final int offset = ids.getOffset(it);
System.out.print(v+" " + offset);
// Do NOT rely on using "internalGetIndex()" directly!
}
System.out.println();
++i;
}
我会说,因为您正在使用 RandomlyGeneratedInitialMeans
:
Initialize k-means by generating random vectors (within the data sets value range).
RandomlyGeneratedInitialMeans init = new RandomlyGeneratedInitialMeans(RandomFactory.DEFAULT);
是的,很正常。
评论太长了。正如@Idos 所述,您正在随机初始化数据;这就是为什么你得到随机结果。现在的问题是,您如何确保结果稳健?试试这个:
运行算法N
次。每次,记录每个观察的聚类成员。完成后,将观察结果分类到最常包含它的集群中。例如,假设您有 3 个观察值,3 个 类,并且 运行 算法 3 次:
obs R1 R2 R3
1 A A B
2 B B B
3 C B B
那么您应该将 obs1
分类为 A
,因为它最常被分类为 A
。将 obs2
分类为 B
,因为它始终被分类为 B
。并将 obs3
分类为 B
,因为算法最常将其分类为 B
。 运行 算法次数越多,结果应该会变得越稳定。
K-Means 应该 随机初始化。 希望多次运行得到不同的结果。
如果您不想这样,使用固定的随机种子。
来自您复制粘贴的代码:
// To fix the random seed, use: new RandomFactory(seed);
这正是你应该做的...
long seed = 0;
RandomlyGeneratedInitialMeans init = new RandomlyGeneratedInitialMeans(
new RandomFactory(seed));