如何显示类之间的相似度?
How to display the similarity between classes?
我有两位作家的手写样本。我正在使用特征提取器从两者中提取特征。
我想显示 class 之间的相似度。至于显示两者有多么相同,以及 classifier 要 class 正确地验证它们有多困难。
我已经阅读了使用 PCA 来证明这一点的论文。我尝试使用 PCA,但我认为我没有正确实施。我用它来显示相似度。
[COEFF,SCORE] = princomp(features_extracted);
plot(COEFF,'.')
但是对于每个 class 和每个样本,我得到完全相同的图。我的意思是它们应该相似而不是完全相同。我做错了什么?
如果每个 class 仅 10 个样本和超过 4000 个特征,您将难以显示任何重要内容。
尽管如此,以下代码将计算 PCA 并显示前两个主成分(包含 'most' 方差的成分)之间的关系。
% Truly indistinguishable data
dummy_data = randn(20, 4000);
% Uncomment this to make the data distinguishable
%dummy_data(1:10, :) = dummy_data(1:10, :) - 0.5;
% Normalise the data - this isn't technically required for the dummy data
% above, but is included for completeness.
dummy_data_normalised = dummy_data;
for f = 1:size(a, 2)
dummy_data_normalised(:, f) = dummy_data_normalised(:, f) - nanmean(dummy_data_normalised(:, f));
dummy_data_normalised(:, f) = dummy_data_normalised(:, f) / nanstd(dummy_data_normalised(:, f));
end
% Generate vector of 10 0's and 10 1's
class_labels = reshape(repmat([0 1], 10, 1), 20, 1);
% Perform PCA
pca_coeffs = pca(dummy_data_normalised);
% Calculate transformed data
dummy_data_pca = dummy_data_normalised * pca_coeffs;
figure;
hold on;
for class = unique(class_labels)'
% Plot first two components of first class
scatter(dummy_data_pca(class_labels == class, 1), dummy_data_pca(class_labels == class, 2), 'filled')
end
legend(strcat({'Class '},int2str(unique(class_labels)))')
对于难以区分的数据,这将显示类似于以下的散点图:
显然不可能在两个 classes 之间画出分隔线。
如果你取消注释第五行以使数据可区分,那么情节将改为如下:
但是,重复一下我在评论中写的内容,PCA 不一定 找到提供最佳分离的组件。它是一种无监督方法,只找到具有最大方差的组件。在某些应用中,这也是提供良好分离的组件。每个 class 只有 10 个样本,您将无法证明任何具有统计意义的内容。另请查看 this question 以了解有关 PCA 的更多详细信息以及每个 class.
的样本数
编辑: 这也自然地延伸到拥有更多 classes:
numer_of_classes = 10;
samples_per_class = 20;
% Truly indistinguishable data
dummy_data = randn(numer_of_classes * samples_per_class, 4000);
% Make the data distinguishable
for i = 1:numer_of_classes
ixd = (((i - 1) * samples_per_class) + 1):(i * samples_per_class);
dummy_data(ixd, :) = dummy_data(ixd, :) - (0.5 * (i - 1));
end
% Normalise the data
dummy_data_normalised = dummy_data;
for f = 1:size(a, 2)
dummy_data_normalised(:, f) = dummy_data_normalised(:, f) - nanmean(dummy_data_normalised(:, f));
dummy_data_normalised(:, f) = dummy_data_normalised(:, f) / nanstd(dummy_data_normalised(:, f));
end
% Generate vector of classes (1 to numer_of_classes)
class_labels = reshape(repmat(1:numer_of_classes, samples_per_class, 1), numer_of_classes * samples_per_class, 1);
% Perform PCA
pca_coeffs = pca(dummy_data_normalised);
% Calculate transformed data
dummy_data_pca = dummy_data_normalised * pca_coeffs;
figure;
hold on;
for class = unique(class_labels)'
% Plot first two components of first class
scatter(dummy_data_pca(class_labels == class, 1), dummy_data_pca(class_labels == class, 2), 'filled')
end
legend(strcat({'Class '},int2str(unique(class_labels)))')
我有两位作家的手写样本。我正在使用特征提取器从两者中提取特征。
我想显示 class 之间的相似度。至于显示两者有多么相同,以及 classifier 要 class 正确地验证它们有多困难。
我已经阅读了使用 PCA 来证明这一点的论文。我尝试使用 PCA,但我认为我没有正确实施。我用它来显示相似度。
[COEFF,SCORE] = princomp(features_extracted);
plot(COEFF,'.')
但是对于每个 class 和每个样本,我得到完全相同的图。我的意思是它们应该相似而不是完全相同。我做错了什么?
如果每个 class 仅 10 个样本和超过 4000 个特征,您将难以显示任何重要内容。
尽管如此,以下代码将计算 PCA 并显示前两个主成分(包含 'most' 方差的成分)之间的关系。
% Truly indistinguishable data
dummy_data = randn(20, 4000);
% Uncomment this to make the data distinguishable
%dummy_data(1:10, :) = dummy_data(1:10, :) - 0.5;
% Normalise the data - this isn't technically required for the dummy data
% above, but is included for completeness.
dummy_data_normalised = dummy_data;
for f = 1:size(a, 2)
dummy_data_normalised(:, f) = dummy_data_normalised(:, f) - nanmean(dummy_data_normalised(:, f));
dummy_data_normalised(:, f) = dummy_data_normalised(:, f) / nanstd(dummy_data_normalised(:, f));
end
% Generate vector of 10 0's and 10 1's
class_labels = reshape(repmat([0 1], 10, 1), 20, 1);
% Perform PCA
pca_coeffs = pca(dummy_data_normalised);
% Calculate transformed data
dummy_data_pca = dummy_data_normalised * pca_coeffs;
figure;
hold on;
for class = unique(class_labels)'
% Plot first two components of first class
scatter(dummy_data_pca(class_labels == class, 1), dummy_data_pca(class_labels == class, 2), 'filled')
end
legend(strcat({'Class '},int2str(unique(class_labels)))')
对于难以区分的数据,这将显示类似于以下的散点图:
显然不可能在两个 classes 之间画出分隔线。
如果你取消注释第五行以使数据可区分,那么情节将改为如下:
但是,重复一下我在评论中写的内容,PCA 不一定 找到提供最佳分离的组件。它是一种无监督方法,只找到具有最大方差的组件。在某些应用中,这也是提供良好分离的组件。每个 class 只有 10 个样本,您将无法证明任何具有统计意义的内容。另请查看 this question 以了解有关 PCA 的更多详细信息以及每个 class.
的样本数编辑: 这也自然地延伸到拥有更多 classes:
numer_of_classes = 10;
samples_per_class = 20;
% Truly indistinguishable data
dummy_data = randn(numer_of_classes * samples_per_class, 4000);
% Make the data distinguishable
for i = 1:numer_of_classes
ixd = (((i - 1) * samples_per_class) + 1):(i * samples_per_class);
dummy_data(ixd, :) = dummy_data(ixd, :) - (0.5 * (i - 1));
end
% Normalise the data
dummy_data_normalised = dummy_data;
for f = 1:size(a, 2)
dummy_data_normalised(:, f) = dummy_data_normalised(:, f) - nanmean(dummy_data_normalised(:, f));
dummy_data_normalised(:, f) = dummy_data_normalised(:, f) / nanstd(dummy_data_normalised(:, f));
end
% Generate vector of classes (1 to numer_of_classes)
class_labels = reshape(repmat(1:numer_of_classes, samples_per_class, 1), numer_of_classes * samples_per_class, 1);
% Perform PCA
pca_coeffs = pca(dummy_data_normalised);
% Calculate transformed data
dummy_data_pca = dummy_data_normalised * pca_coeffs;
figure;
hold on;
for class = unique(class_labels)'
% Plot first two components of first class
scatter(dummy_data_pca(class_labels == class, 1), dummy_data_pca(class_labels == class, 2), 'filled')
end
legend(strcat({'Class '},int2str(unique(class_labels)))')