使用张量流的 PCA 对 MNIST 数据的 SVM

SVM on MNIST data with PCA using tensorflow

我打算使用 SVD 了解 PCA,因此实现了它并尝试在 MNIST 数据上使用它。

import numpy as np

class PCA(object):

    def __init__ (self, X):

        self.N, self.dim, *rest = X.shape
        self.X = X

        '''
        U S V' = svd(X) 
        '''
        X_std = (X - np.mean(X, axis=0))/(np.std(X, axis=0)+1e-13)

        [self.U, self.s, self.Vt] = np.linalg.svd(X_std)
        self.V = self.Vt.T
        self.variance_ratio = self.s


    def variance_explained_ratio (self):

        '''
        Returns the cumulative variance captured with each added principal component
        '''
        return np.cumsum(self.variance_ratio)/np.sum(self.variance_ratio)

    def X_projected (self, r):

        '''
        Returns the data X projected along the first r principal components
        '''

        if r is None:
            r = self.dim
        X_proj = np.zeros((r, self.N))
        P_reduce = self.V[:,0:r]
        X_proj = self.X.dot(P_reduce)
        return X_proj

现在有了这个 PCA 实现,我尝试将它应用于 MNIST 数据,以查看使用 softmax 进行分类时使用和不使用 PCA 的性能。其代码如下:

from tensorflow.examples.tutorials.mnist import input_data
mnist = input_data.read_data_sets("MNIST_data/", one_hot=True)

# Using first 10000 images 
train_data = mnist.train.images[:10000,:]
train_labels = mnist.train.labels[:10000,:]
pca1 = PCA(train_data)
pca_test = PCA(mnist.test.images)

n_components = 14
X_proj1 = pca1.X_projected(r=n_components)
X_projTest = pca_test.X_projected(r=n_components)

t1 = time.time()

x = tf.placeholder(tf.float32, [None, n_components])
W = tf.Variable(tf.zeros([n_components, 10]))
b = tf.Variable(tf.zeros([10]))


y = tf.cast(tf.nn.softmax(tf.matmul(x, W) + b), tf.float32)
y_ = tf.placeholder(tf.float32, [None, 10])
cross_entropy = tf.reduce_mean(-tf.reduce_sum(y_*tf.log(y), 
reduction_indices=[1]))

train_step = 
tf.train.GradientDescentOptimizer(0.7).minimize(cross_entropy)
sess = tf.InteractiveSession()
tf.global_variables_initializer().run()

m = 10000

for _ in range(1000):
    indices = random.sample(range(0, m), 100)
    batch_xs = X_proj1[indices]
    batch_ys = train_labels[indices]
    sess.run(train_step, feed_dict={x: batch_xs, y_: batch_ys})

correct_prediction = tf.equal(tf.argmax(y,1), tf.argmax(y_,1))
accuracy = tf.reduce_mean(tf.cast(correct_prediction, tf.float32))


accuracy = sess.run(accuracy, feed_dict={x: X_projTest, y_: 
mnist.test.labels})
print("Accuracy: %f" % accuracy)
sess.close()
t2 = time.time()
print ("Total time taken: %f seconds" % (t2-t1))

我使用它获得的准确度只有 19% 左右,而使用 train_data 和 train_labels,准确度超过 90%。有人可以建议我哪里出错了吗?

当我们使用 PCA 或特征缩放时,我们在训练数据集上设置基础参数,然后 apply/transform 在测试数据集上设置它。测试数据集不用于计算关键参数,或者在这种情况下,SVD 应该只应用于训练数据集。 例如在 sklearn 的 PCA 中,我们使用以下代码:

from sklearn.decomposition import PCA
pca = PCA(n_components = 'whatever number you want')
X_train_pca = pca.fit_transform(X_train)
X_test_pca = pca.transform(X_test)

请注意,我们适合训练数据集 X_train 并在 X_test 上进行转换。

同样,对于上面的实现,不需要创建pca_test对象。将 X_projTest 变量调整为 :

X_projTest = mnist.test.images.dot(pca1.V[:,0:n_components])

这应该可以解决测试精度低的问题。