我将如何使用 TensorFlow 实施 k-means?
How would I implement k-means with TensorFlow?
使用内置梯度下降优化器的介绍教程很有意义。然而,k-means 不仅仅是我可以插入梯度下降的东西。似乎我必须编写自己的优化器,但鉴于 TensorFlow 原语,我不太确定如何做到这一点。
我应该采取什么方法?
(注意:你现在可以获得a more polished version of this code as a gist on github。)
你当然可以做到,但你需要定义自己的优化标准(对于 k-means,通常是最大迭代次数和分配稳定时)。这里有一个你可以如何做的例子(可能有更多的优化方法来实现它,而且绝对有更好的方法来 select 初始点)。如果你真的很努力地避免在 python:
中迭代地做事,这基本上就像你在 numpy 中做的一样
import tensorflow as tf
import numpy as np
import time
N=10000
K=4
MAX_ITERS = 1000
start = time.time()
points = tf.Variable(tf.random_uniform([N,2]))
cluster_assignments = tf.Variable(tf.zeros([N], dtype=tf.int64))
# Silly initialization: Use the first two points as the starting
# centroids. In the real world, do this better.
centroids = tf.Variable(tf.slice(points.initialized_value(), [0,0], [K,2]))
# Replicate to N copies of each centroid and K copies of each
# point, then subtract and compute the sum of squared distances.
rep_centroids = tf.reshape(tf.tile(centroids, [N, 1]), [N, K, 2])
rep_points = tf.reshape(tf.tile(points, [1, K]), [N, K, 2])
sum_squares = tf.reduce_sum(tf.square(rep_points - rep_centroids),
reduction_indices=2)
# Use argmin to select the lowest-distance point
best_centroids = tf.argmin(sum_squares, 1)
did_assignments_change = tf.reduce_any(tf.not_equal(best_centroids,
cluster_assignments))
def bucket_mean(data, bucket_ids, num_buckets):
total = tf.unsorted_segment_sum(data, bucket_ids, num_buckets)
count = tf.unsorted_segment_sum(tf.ones_like(data), bucket_ids, num_buckets)
return total / count
means = bucket_mean(points, best_centroids, K)
# Do not write to the assigned clusters variable until after
# computing whether the assignments have changed - hence with_dependencies
with tf.control_dependencies([did_assignments_change]):
do_updates = tf.group(
centroids.assign(means),
cluster_assignments.assign(best_centroids))
sess = tf.Session()
sess.run(tf.initialize_all_variables())
changed = True
iters = 0
while changed and iters < MAX_ITERS:
iters += 1
[changed, _] = sess.run([did_assignments_change, do_updates])
[centers, assignments] = sess.run([centroids, cluster_assignments])
end = time.time()
print ("Found in %.2f seconds" % (end-start)), iters, "iterations"
print "Centroids:"
print centers
print "Cluster assignments:", assignments
(请注意,真正的实现需要更加注意初始集群 selection,避免所有点都进入一个集群等问题。这只是一个快速演示。我已经更新了我之前的回答,使其更清晰 "example-worthy"。)
到目前为止,我看到的大多数答案都只关注二维版本(当您需要在二维中聚类点时)。这是我在任意维度上的聚类实现。
n dims k-means algorithm 的基本概念:
- 生成随机k个起点
- 这样做直到你超过耐心或集群分配不改变:
- 将每个点分配给最近的起点
- 通过取其簇中的平均值重新计算每个起点的位置
为了能够以某种方式验证结果,我将尝试对 MNIST 图像进行聚类。
import numpy as np
import tensorflow as tf
from random import randint
from collections import Counter
from tensorflow.examples.tutorials.mnist import input_data
mnist = input_data.read_data_sets("MNIST_data/")
X, y, k = mnist.test.images, mnist.test.labels, 10
所以这里X是我要聚类的数据(10000, 784)
,y是实数,k是簇的个数(和位数一样。现在实际算法:
# select random points as a starting position. You can do better by randomly selecting k points.
start_pos = tf.Variable(X[np.random.randint(X.shape[0], size=k),:], dtype=tf.float32)
centroids = tf.Variable(start_pos.initialized_value(), 'S', dtype=tf.float32)
# populate points
points = tf.Variable(X, 'X', dtype=tf.float32)
ones_like = tf.ones((points.get_shape()[0], 1))
prev_assignments = tf.Variable(tf.zeros((points.get_shape()[0], ), dtype=tf.int64))
# find the distance between all points:
p1 = tf.matmul(
tf.expand_dims(tf.reduce_sum(tf.square(points), 1), 1),
tf.ones(shape=(1, k))
)
p2 = tf.transpose(tf.matmul(
tf.reshape(tf.reduce_sum(tf.square(centroids), 1), shape=[-1, 1]),
ones_like,
transpose_b=True
))
distance = tf.sqrt(tf.add(p1, p2) - 2 * tf.matmul(points, centroids, transpose_b=True))
# assign each point to a closest centroid
point_to_centroid_assignment = tf.argmin(distance, axis=1)
# recalculate the centers
total = tf.unsorted_segment_sum(points, point_to_centroid_assignment, k)
count = tf.unsorted_segment_sum(ones_like, point_to_centroid_assignment, k)
means = total / count
# continue if there is any difference between the current and previous assignment
is_continue = tf.reduce_any(tf.not_equal(point_to_centroid_assignment, prev_assignments))
with tf.control_dependencies([is_continue]):
loop = tf.group(centroids.assign(means), prev_assignments.assign(point_to_centroid_assignment))
sess = tf.Session()
sess.run(tf.global_variables_initializer())
# do many iterations. Hopefully you will stop because of has_changed is False
has_changed, cnt = True, 0
while has_changed and cnt < 300:
cnt += 1
has_changed, _ = sess.run([is_continue, loop])
# see how the data is assigned
res = sess.run(point_to_centroid_assignment)
现在是时候检查我们的集群有多好了。为此,我们将集群中出现的所有实数组合在一起。之后,我们将看到该集群中最受欢迎的选择。在完美聚类的情况下,我们将在每个组中只有一个值。在随机聚类的情况下,每个值将在组中大致相等地表示。
nums_in_clusters = [[] for i in xrange(10)]
for cluster, real_num in zip(list(res), list(y)):
nums_in_clusters[cluster].append(real_num)
for i in xrange(10):
print Counter(nums_in_clusters[i]).most_common(3)
这给了我这样的东西:
[(0, 738), (6, 18), (2, 11)]
[(1, 641), (3, 53), (2, 51)]
[(1, 488), (2, 115), (7, 56)]
[(4, 550), (9, 533), (7, 280)]
[(7, 634), (9, 400), (4, 302)]
[(6, 649), (4, 27), (0, 14)]
[(5, 269), (6, 244), (0, 161)]
[(8, 646), (5, 164), (3, 125)]
[(2, 698), (3, 34), (7, 14)]
[(3, 712), (5, 290), (8, 110)]
这很好,因为大多数计数都在第一组中。您会看到聚类混淆了 7 和 9、4 和 5。但是 0 很好地聚类了。
一些改进方法:
- 运行 几次算法,select 最好的一个(基于到簇的距离)
- 处理未向群集分配任何内容的情况。在我的例子中,你会在
means
变量中得到 Nan 因为 count
是 0.
- 随机点初始化。
如今您可以直接使用(或从中汲取灵感)KMeansClustering Estimator. You can take a look at its implementation on GitHub。
使用内置梯度下降优化器的介绍教程很有意义。然而,k-means 不仅仅是我可以插入梯度下降的东西。似乎我必须编写自己的优化器,但鉴于 TensorFlow 原语,我不太确定如何做到这一点。
我应该采取什么方法?
(注意:你现在可以获得a more polished version of this code as a gist on github。)
你当然可以做到,但你需要定义自己的优化标准(对于 k-means,通常是最大迭代次数和分配稳定时)。这里有一个你可以如何做的例子(可能有更多的优化方法来实现它,而且绝对有更好的方法来 select 初始点)。如果你真的很努力地避免在 python:
中迭代地做事,这基本上就像你在 numpy 中做的一样import tensorflow as tf
import numpy as np
import time
N=10000
K=4
MAX_ITERS = 1000
start = time.time()
points = tf.Variable(tf.random_uniform([N,2]))
cluster_assignments = tf.Variable(tf.zeros([N], dtype=tf.int64))
# Silly initialization: Use the first two points as the starting
# centroids. In the real world, do this better.
centroids = tf.Variable(tf.slice(points.initialized_value(), [0,0], [K,2]))
# Replicate to N copies of each centroid and K copies of each
# point, then subtract and compute the sum of squared distances.
rep_centroids = tf.reshape(tf.tile(centroids, [N, 1]), [N, K, 2])
rep_points = tf.reshape(tf.tile(points, [1, K]), [N, K, 2])
sum_squares = tf.reduce_sum(tf.square(rep_points - rep_centroids),
reduction_indices=2)
# Use argmin to select the lowest-distance point
best_centroids = tf.argmin(sum_squares, 1)
did_assignments_change = tf.reduce_any(tf.not_equal(best_centroids,
cluster_assignments))
def bucket_mean(data, bucket_ids, num_buckets):
total = tf.unsorted_segment_sum(data, bucket_ids, num_buckets)
count = tf.unsorted_segment_sum(tf.ones_like(data), bucket_ids, num_buckets)
return total / count
means = bucket_mean(points, best_centroids, K)
# Do not write to the assigned clusters variable until after
# computing whether the assignments have changed - hence with_dependencies
with tf.control_dependencies([did_assignments_change]):
do_updates = tf.group(
centroids.assign(means),
cluster_assignments.assign(best_centroids))
sess = tf.Session()
sess.run(tf.initialize_all_variables())
changed = True
iters = 0
while changed and iters < MAX_ITERS:
iters += 1
[changed, _] = sess.run([did_assignments_change, do_updates])
[centers, assignments] = sess.run([centroids, cluster_assignments])
end = time.time()
print ("Found in %.2f seconds" % (end-start)), iters, "iterations"
print "Centroids:"
print centers
print "Cluster assignments:", assignments
(请注意,真正的实现需要更加注意初始集群 selection,避免所有点都进入一个集群等问题。这只是一个快速演示。我已经更新了我之前的回答,使其更清晰 "example-worthy"。)
到目前为止,我看到的大多数答案都只关注二维版本(当您需要在二维中聚类点时)。这是我在任意维度上的聚类实现。
n dims k-means algorithm 的基本概念:
- 生成随机k个起点
- 这样做直到你超过耐心或集群分配不改变:
- 将每个点分配给最近的起点
- 通过取其簇中的平均值重新计算每个起点的位置
为了能够以某种方式验证结果,我将尝试对 MNIST 图像进行聚类。
import numpy as np
import tensorflow as tf
from random import randint
from collections import Counter
from tensorflow.examples.tutorials.mnist import input_data
mnist = input_data.read_data_sets("MNIST_data/")
X, y, k = mnist.test.images, mnist.test.labels, 10
所以这里X是我要聚类的数据(10000, 784)
,y是实数,k是簇的个数(和位数一样。现在实际算法:
# select random points as a starting position. You can do better by randomly selecting k points.
start_pos = tf.Variable(X[np.random.randint(X.shape[0], size=k),:], dtype=tf.float32)
centroids = tf.Variable(start_pos.initialized_value(), 'S', dtype=tf.float32)
# populate points
points = tf.Variable(X, 'X', dtype=tf.float32)
ones_like = tf.ones((points.get_shape()[0], 1))
prev_assignments = tf.Variable(tf.zeros((points.get_shape()[0], ), dtype=tf.int64))
# find the distance between all points:
p1 = tf.matmul(
tf.expand_dims(tf.reduce_sum(tf.square(points), 1), 1),
tf.ones(shape=(1, k))
)
p2 = tf.transpose(tf.matmul(
tf.reshape(tf.reduce_sum(tf.square(centroids), 1), shape=[-1, 1]),
ones_like,
transpose_b=True
))
distance = tf.sqrt(tf.add(p1, p2) - 2 * tf.matmul(points, centroids, transpose_b=True))
# assign each point to a closest centroid
point_to_centroid_assignment = tf.argmin(distance, axis=1)
# recalculate the centers
total = tf.unsorted_segment_sum(points, point_to_centroid_assignment, k)
count = tf.unsorted_segment_sum(ones_like, point_to_centroid_assignment, k)
means = total / count
# continue if there is any difference between the current and previous assignment
is_continue = tf.reduce_any(tf.not_equal(point_to_centroid_assignment, prev_assignments))
with tf.control_dependencies([is_continue]):
loop = tf.group(centroids.assign(means), prev_assignments.assign(point_to_centroid_assignment))
sess = tf.Session()
sess.run(tf.global_variables_initializer())
# do many iterations. Hopefully you will stop because of has_changed is False
has_changed, cnt = True, 0
while has_changed and cnt < 300:
cnt += 1
has_changed, _ = sess.run([is_continue, loop])
# see how the data is assigned
res = sess.run(point_to_centroid_assignment)
现在是时候检查我们的集群有多好了。为此,我们将集群中出现的所有实数组合在一起。之后,我们将看到该集群中最受欢迎的选择。在完美聚类的情况下,我们将在每个组中只有一个值。在随机聚类的情况下,每个值将在组中大致相等地表示。
nums_in_clusters = [[] for i in xrange(10)]
for cluster, real_num in zip(list(res), list(y)):
nums_in_clusters[cluster].append(real_num)
for i in xrange(10):
print Counter(nums_in_clusters[i]).most_common(3)
这给了我这样的东西:
[(0, 738), (6, 18), (2, 11)]
[(1, 641), (3, 53), (2, 51)]
[(1, 488), (2, 115), (7, 56)]
[(4, 550), (9, 533), (7, 280)]
[(7, 634), (9, 400), (4, 302)]
[(6, 649), (4, 27), (0, 14)]
[(5, 269), (6, 244), (0, 161)]
[(8, 646), (5, 164), (3, 125)]
[(2, 698), (3, 34), (7, 14)]
[(3, 712), (5, 290), (8, 110)]
这很好,因为大多数计数都在第一组中。您会看到聚类混淆了 7 和 9、4 和 5。但是 0 很好地聚类了。
一些改进方法:
- 运行 几次算法,select 最好的一个(基于到簇的距离)
- 处理未向群集分配任何内容的情况。在我的例子中,你会在
means
变量中得到 Nan 因为count
是 0. - 随机点初始化。
如今您可以直接使用(或从中汲取灵感)KMeansClustering Estimator. You can take a look at its implementation on GitHub。