每次运行我的代码都会得到不同的结果

Question

我正在使用 ELKI 对我的数据进行聚类我使用 KMeansLloyd<NumberVector> with k=3 每次我运行我的 java 代码我完全不同的集群结果，这是正常的还是我应该做些什么来使我的输出接近稳定？这是我从 elki 教程

获得的代码

DatabaseConnection dbc = new ArrayAdapterDatabaseConnection(a);
    // Create a database (which may contain multiple relations!)
    Database db = new StaticArrayDatabase(dbc, null);
    // Load the data into the database (do NOT forget to initialize...)
    db.initialize();
    // Relation containing the number vectors:
    Relation<NumberVector> rel = db.getRelation(TypeUtil.NUMBER_VECTOR_FIELD);
    // We know that the ids must be a continuous range:
    DBIDRange ids = (DBIDRange) rel.getDBIDs();

    // K-means should be used with squared Euclidean (least squares):
    //SquaredEuclideanDistanceFunction dist = SquaredEuclideanDistanceFunction.STATIC;
    CosineDistanceFunction dist= CosineDistanceFunction.STATIC;

    // Default initialization, using global random:
    // To fix the random seed, use: new RandomFactory(seed);
    RandomlyGeneratedInitialMeans init = new RandomlyGeneratedInitialMeans(RandomFactory.DEFAULT);

    // Textbook k-means clustering:
    KMeansLloyd<NumberVector> km = new KMeansLloyd<>(dist, //
    3 /* k - number of partitions */, //
    0 /* maximum number of iterations: no limit */, init);

    // K-means will automatically choose a numerical relation from the data set:
    // But we could make it explicit (if there were more than one numeric
    // relation!): km.run(db, rel);
    Clustering<KMeansModel> c = km.run(db);

    // Output all clusters:
    int i = 0;
    for(Cluster<KMeansModel> clu : c.getAllClusters()) {
      // K-means will name all clusters "Cluster" in lack of noise support:
      System.out.println("#" + i + ": " + clu.getNameAutomatic());
      System.out.println("Size: " + clu.size());
      System.out.println("Center: " + clu.getModel().getPrototype().toString());
      // Iterate over objects:
      System.out.print("Objects: ");

      for(DBIDIter it = clu.getIDs().iter(); it.valid(); it.advance()) {
        // To get the vector use:
         NumberVector v = rel.get(it);

        // Offset within our DBID range: "line number"
        final int offset = ids.getOffset(it);
        System.out.print(v+" " + offset);
        // Do NOT rely on using "internalGetIndex()" directly!
      }
      System.out.println();
      ++i;
    }

Answer 1

我会说，因为您正在使用 RandomlyGeneratedInitialMeans：

Initialize k-means by generating random vectors (within the data sets value range).

RandomlyGeneratedInitialMeans init = new RandomlyGeneratedInitialMeans(RandomFactory.DEFAULT);

是的，很正常。

Answer 2

评论太长了。正如@Idos 所述，您正在随机初始化数据；这就是为什么你得到随机结果。现在的问题是，您如何确保结果稳健？试试这个：

运行算法N次。每次，记录每个观察的聚类成员。完成后，将观察结果分类到最常包含它的集群中。例如，假设您有 3 个观察值，3 个类，并且运行算法 3 次：

obs R1  R2  R3
1   A   A   B
2   B   B   B
3   C   B   B

那么您应该将 obs1 分类为 A，因为它最常被分类为 A。将 obs2 分类为 B，因为它始终被分类为 B。并将 obs3 分类为 B，因为算法最常将其分类为 B。运行算法次数越多，结果应该会变得越稳定。

Answer 3

K-Means 应该随机初始化。希望多次运行得到不同的结果。

如果您不想这样，使用固定的随机种子。

来自您复制粘贴的代码：

// To fix the random seed, use: new RandomFactory(seed);

这正是你应该做的...

long seed = 0;
RandomlyGeneratedInitialMeans init = new RandomlyGeneratedInitialMeans(
  new RandomFactory(seed));

每次运行我的代码都会得到不同的结果

I'm getting different results every time I run my code

java

cluster-analysis

elki

每次 运行 我的代码都会得到不同的结果

I'm getting different results every time I run my code

java

cluster-analysis

elki

每次运行我的代码都会得到不同的结果