如何显示类之间的相似度?

How to display the similarity between classes?

我有两位作家的手写样本。我正在使用特征提取器从两者中提取特征。

我想显示 class 之间的相似度。至于显示两者有多么相同,以及 classifier 要 class 正确地验证它们有多困难。

我已经阅读了使用 PCA 来证明这一点的论文。我尝试使用 PCA,但我认为我没有正确实施。我用它来显示相似度。

[COEFF,SCORE] = princomp(features_extracted);
plot(COEFF,'.')

但是对于每个 class 和每个样本,我得到完全相同的图。我的意思是它们应该相似而不是完全相同。我做错了什么?

如果每个 class 仅 10 个样本和超过 4000 个特征,您将难以显示任何重要内容。

尽管如此,以下代码将计算 PCA 并显示前两个主成分(包含 'most' 方差的成分)之间的关系。

% Truly indistinguishable data
dummy_data = randn(20, 4000);

% Uncomment this to make the data distinguishable
%dummy_data(1:10, :) = dummy_data(1:10, :) - 0.5;

% Normalise the data - this isn't technically required for the dummy data
% above, but is included for completeness.
dummy_data_normalised = dummy_data;
for f = 1:size(a, 2)
    dummy_data_normalised(:, f) = dummy_data_normalised(:, f) - nanmean(dummy_data_normalised(:, f));
    dummy_data_normalised(:, f) = dummy_data_normalised(:, f) / nanstd(dummy_data_normalised(:, f));
end

% Generate vector of 10 0's and 10 1's
class_labels = reshape(repmat([0 1], 10, 1), 20, 1);

% Perform PCA
pca_coeffs = pca(dummy_data_normalised);

% Calculate transformed data
dummy_data_pca = dummy_data_normalised * pca_coeffs;

figure;
hold on;

for class = unique(class_labels)'
    % Plot first two components of first class
    scatter(dummy_data_pca(class_labels == class, 1), dummy_data_pca(class_labels == class, 2), 'filled')
end

legend(strcat({'Class '},int2str(unique(class_labels)))')

对于难以区分的数据,这将显示类似于以下的散点图:

显然不可能在两个 classes 之间画出分隔线。

如果你取消注释第五行以使数据可区分,那么情节将改为如下:

但是,重复一下我在评论中写的内容,PCA 不一定 找到提供最佳分离的组件。它是一种无监督方法,只找到具有最大方差的组件。在某些应用中,这也是提供良好分离的组件。每个 class 只有 10 个样本,您将无法证明任何具有统计意义的内容。另请查看 this question 以了解有关 PCA 的更多详细信息以及每个 class.

的样本数

编辑: 这也自然地延伸到拥有更多 classes:

numer_of_classes = 10;
samples_per_class = 20;

% Truly indistinguishable data
dummy_data = randn(numer_of_classes * samples_per_class, 4000);

% Make the data distinguishable
for i = 1:numer_of_classes
    ixd = (((i - 1) * samples_per_class) + 1):(i * samples_per_class);
    dummy_data(ixd, :) = dummy_data(ixd, :) - (0.5 * (i - 1));
end

% Normalise the data
dummy_data_normalised = dummy_data;
for f = 1:size(a, 2)
    dummy_data_normalised(:, f) = dummy_data_normalised(:, f) - nanmean(dummy_data_normalised(:, f));
    dummy_data_normalised(:, f) = dummy_data_normalised(:, f) / nanstd(dummy_data_normalised(:, f));
end

% Generate vector of classes (1 to numer_of_classes)
class_labels = reshape(repmat(1:numer_of_classes, samples_per_class, 1), numer_of_classes * samples_per_class, 1);

% Perform PCA
pca_coeffs = pca(dummy_data_normalised);

% Calculate transformed data
dummy_data_pca = dummy_data_normalised * pca_coeffs;

figure;
hold on;

for class = unique(class_labels)'
    % Plot first two components of first class
    scatter(dummy_data_pca(class_labels == class, 1), dummy_data_pca(class_labels == class, 2), 'filled')
end

legend(strcat({'Class '},int2str(unique(class_labels)))')