我可以标准化我的 PCA 应用计数向量吗？

Question

我在 X_train 上应用了 CountVectorizer()，它返回了一个稀疏矩阵。

通常，如果我们想要标准化稀疏矩阵，我们会传入 with_mean=False 参数。

scaler = StandardScaler(with_mean=False)
X_train = scaler.fit_transform()

但就我而言，在 X_train 上应用 CountVectorizer 后，我还执行了 PCA(TruncatedSVD) 来减小尺寸。现在我的数据不是稀疏矩阵。

那现在可以不通过with_mean=False直接申请StandardScaler()吗(i.e with_mean=True)？

Answer 1

如果您对 with_mean 参数的作用采取 look，您会发现它只是在缩放之前将数据居中。之所以不将稀疏矩阵居中，是因为当你尝试将稀疏矩阵居中时，它会变成稠密矩阵并占用更多内存，从而首先破坏其稀疏性。

在执行 PCA 后，您的数据已缩小尺寸，现在可以在缩放之前居中。所以是的，你可以直接申请StandardScaler()。

Can I standardize my PCA applied count vector?