scikit-learn PCA 变换 returns 不正确的缩减特征长度
scikit-learn PCA transform returns incorrect reduced feature length
我尝试在我的代码中应用 PCA,当我使用以下代码训练我的数据时:
def gather_train():
train_data = np.array([])
train_labels = np.array([])
with open(training_info, "r") as traincsv:
for line in traincsv:
current_image = "train\{}".format(line.strip().split(",")[0])
print "Reading data from: {}".format(current_image)
train_labels = np.append(train_labels, int(line.strip().split(",")[1]))
with open(current_image, "rb") as img:
train_data = np.append(train_data, np.fromfile(img, dtype=np.uint8).reshape(-1, 784)/255.0)
train_data = train_data.reshape(len(train_labels), 784)
return train_data, train_labels
def get_PCA_train(data):
print "\nFitting PCA. Components: {} ...".format(PCA_components)
pca = decomposition.PCA(n_components=PCA_components).fit(data)
print "\nReducing data to {} components ...".format(PCA_components)
data_reduced = pca.fit_transform(data)
return data_reduced
def get_PCA_test(data):
print "\nFitting PCA. Components: {} ...".format(PCA_components)
pca = decomposition.PCA(n_components=PCA_components).fit(data)
print "\nReducing data to {} components ...".format(PCA_components)
data_reduced = pca.transform(data)
return data_reduced
def gather_test(imgfile):
#input is a file, and reads data from it. different from gather_train which gathers all at once
with open(imgfile, "rb") as img:
return np.fromfile(img, dtype=np.uint8,).reshape(-1, 784)/255.0
...
train_data = gather_train()
train_data_reduced = get_PCA_train(train_data)
print train_data.ndim, train_data.shape
print train_data_reduced.ndim, train_data_reduced.shape
它打印预期的 ff:
2 (1000L, 784L)
2 (1000L, 300L)
但是当我开始减少我的测试数据时:
test_data = gather_test(image_file)
# image_file is 784 bytes (28x28) of pixel values; 1 byte = 1 pixel value
test_data_reduced = get_PCA_test(test_data)
print test_data.ndim, test_data.shape
print test_data_reduced.ndim, test_data_reduced.shape
输出为:
2 (1L, 784L)
2 (1L, 1L)
稍后会导致错误:
ValueError: X.shape[1] = 1 should be equal to 300, the number of
features at training time
为什么 test_data_reduced 的形状是 (1,1)
,而不是 (1,300)
?我已经尝试使用 fit_transform
作为训练数据,而 transform
仅用于测试数据,但仍然出现相同的错误。
对 PCA
的调用大致如下所示:
pca = decomposition.PCA(n_components=PCA_components).fit(train_data)
data_reduced = pca.transform(test_data)
首先你在训练数据上调用fit
,然后在测试数据上调用transform
,你想减少。
我尝试在我的代码中应用 PCA,当我使用以下代码训练我的数据时:
def gather_train():
train_data = np.array([])
train_labels = np.array([])
with open(training_info, "r") as traincsv:
for line in traincsv:
current_image = "train\{}".format(line.strip().split(",")[0])
print "Reading data from: {}".format(current_image)
train_labels = np.append(train_labels, int(line.strip().split(",")[1]))
with open(current_image, "rb") as img:
train_data = np.append(train_data, np.fromfile(img, dtype=np.uint8).reshape(-1, 784)/255.0)
train_data = train_data.reshape(len(train_labels), 784)
return train_data, train_labels
def get_PCA_train(data):
print "\nFitting PCA. Components: {} ...".format(PCA_components)
pca = decomposition.PCA(n_components=PCA_components).fit(data)
print "\nReducing data to {} components ...".format(PCA_components)
data_reduced = pca.fit_transform(data)
return data_reduced
def get_PCA_test(data):
print "\nFitting PCA. Components: {} ...".format(PCA_components)
pca = decomposition.PCA(n_components=PCA_components).fit(data)
print "\nReducing data to {} components ...".format(PCA_components)
data_reduced = pca.transform(data)
return data_reduced
def gather_test(imgfile):
#input is a file, and reads data from it. different from gather_train which gathers all at once
with open(imgfile, "rb") as img:
return np.fromfile(img, dtype=np.uint8,).reshape(-1, 784)/255.0
...
train_data = gather_train()
train_data_reduced = get_PCA_train(train_data)
print train_data.ndim, train_data.shape
print train_data_reduced.ndim, train_data_reduced.shape
它打印预期的 ff:
2 (1000L, 784L)
2 (1000L, 300L)
但是当我开始减少我的测试数据时:
test_data = gather_test(image_file)
# image_file is 784 bytes (28x28) of pixel values; 1 byte = 1 pixel value
test_data_reduced = get_PCA_test(test_data)
print test_data.ndim, test_data.shape
print test_data_reduced.ndim, test_data_reduced.shape
输出为:
2 (1L, 784L)
2 (1L, 1L)
稍后会导致错误:
ValueError: X.shape[1] = 1 should be equal to 300, the number of features at training time
为什么 test_data_reduced 的形状是 (1,1)
,而不是 (1,300)
?我已经尝试使用 fit_transform
作为训练数据,而 transform
仅用于测试数据,但仍然出现相同的错误。
对 PCA
的调用大致如下所示:
pca = decomposition.PCA(n_components=PCA_components).fit(train_data)
data_reduced = pca.transform(test_data)
首先你在训练数据上调用fit
,然后在测试数据上调用transform
,你想减少。