Scikit Learn PCA 计算出不正确的最后一行 y 值

Question

我正在 Python3 中使用 Scikitlearn 执行 PCA。

但是，在我运行我的代码之后，最后一行的主成分有一个“关闭”值。我知道最后一行是正确的。

我绘制了三个 PCA 来可视化问题。第一个图（完整数据集）你可以看到预测的“样本”图，但是，在第二个和第三个图中，如果我删除人口（完整数据集的一部分）样本图“奇怪”。

具有计算主成分的数据框（见最后一行）：

      principal_component_1  principal_component_2 Sample_name         Population
0                  3.279363              -0.288892     HG02291  American_Ancestry
1                  3.625035              -0.296081     HG02275  American_Ancestry
2                  3.870248              -0.264558     HG02272  American_Ancestry
3                  3.118460              -0.272594     HG02271  American_Ancestry
4                  2.811992              -0.376418     HG02259  American_Ancestry
...                     ...                    ...         ...                ...
1590               1.849372              -0.167314   HGDP00555  Oceanian_Ancestry
1591               1.666233              -0.224749   HGDP00556  Oceanian_Ancestry
1592               1.983947              -0.202254   HGDP00552  Oceanian_Ancestry
1593               2.202948              -0.210858   HGDP00554  Oceanian_Ancestry
1594              -4.693172             126.672265      Sample             Sample

我使用的代码：

def do_pca(pca_data, sample_name, pops):
    """
    This function plots the PCA data from the sample and dataset in a PCA plot
    """
    
    # initiliaze variabeles for the PCA plot
    pops  = pops + ["Sample"]
    pca_df = pd.read_csv(pca_data, sep=";")
    pca_df = pca_df[pca_df["Population"].isin(pops)].reset_index()
    features = list(pca_df.columns.values)
    features.remove("Population")
    features.remove("Sample_name")
    x = pca_df.loc[:, features].values # Separating out the features
    y = pca_df.loc[:, ["Population", "Sample_name"]] # Separating out the target
    x = StandardScaler().fit_transform(x) # Standardizing the features

    # initiliaze PCA plot
    dot_size = 20
    pca = PCA(n_components=2)
    pc = pca.fit_transform(x)
    pc_df = pd.DataFrame(data=pc, columns=["principal_component_%s" % (x + 1) for x in range(2)])
    
    pc_df["Sample_name"] = y["Sample_name"]
    pc_df["Population"] = y["Population"]
    return pc_df

有人可以向我解释我做错了什么吗？我的代码关闭了吗？

我在 Whosebug 上发现了一个类似的问题，但没有答案：link

Answer 1

尝试将其关闭并重新打开：/

Scikit Learn PCA 计算出不正确的最后一行 y 值

Scikitlearn PCA computes incorrect last row of y-values

python

pca

scikit-learn