R 中的 t-SNE 预测

t-SNE predictions in R

目标:我的目标是在 R 中使用 t-SNE(t 分布随机邻域嵌入)来降低我的训练数据的维度(N 观察和 K 变量,其中 K>>N) 然后旨在为我的测试提出 t-SNE 表示数据。

示例:假设我的目标是将 K 个变量减少到 D=2 维(通常,D= 2D=3 对于 t-SNE)。有两个 R 包:Rtsnetsne,而我在这里使用前者。

# load packages
library(Rtsne)

# Generate Training Data: random standard normal matrix with J=400 variables and N=100 observations
x.train <- matrix(nrom(n=40000, mean=0, sd=1), nrow=100, ncol=400)

# Generate Test Data: random standard normal vector with N=1 observation for J=400 variables
x.test <- rnorm(n=400, mean=0, sd=1)

# perform t-SNE
set.seed(1)
fit.tsne <- Rtsne(X=x.train, dims=2)

其中命令 fit.tsne$Y 将 return 包含数据的 t-SNE 表示的 (100x2) 维对象;也可以通过 plot(fit.tsne$Y) 绘制。

问题:现在,我正在寻找的是一个函数,该函数returns 预测pred 维度 (1x2) 基于我的测试数据在经过训练的 t-SNE 模型上。像,

# The function I am looking for (but doesn't exist yet):
pred <- predict(object=fit.tsne, newdata=x.test)

(如何)这可能吗?你能帮我解决这个问题吗?

t-SNE 从根本上没有做你想做的事。 t-SNE 仅设计用于在低(2 或 3)维 space 中可视化数据集。你一次给它所有你想要可视化的数据。它不是通用的降维工具。

如果您尝试将 t-SNE 应用于 "new" 数据,您可能没有正确考虑您的问题,或者可能只是不理解 t-SNE 的目的。

来自作者本人(https://lvdmaaten.github.io/tsne/):

Once I have a t-SNE map, how can I embed incoming test points in that map?

t-SNE learns a non-parametric mapping, which means that it does not learn an explicit function that maps data from the input space to the map. Therefore, it is not possible to embed test points in an existing map (although you could re-run t-SNE on the full dataset). A potential approach to deal with this would be to train a multivariate regressor to predict the map location from the input data. Alternatively, you could also make such a regressor minimize the t-SNE loss directly, which is what I did in this paper (https://lvdmaaten.github.io/publications/papers/AISTATS_2009.pdf).

所以你不能直接应用新的数据点。但是,您可以在数据和嵌入维度之间拟合多元回归模型。作者认识到这是该方法的局限性,并建议使用这种方法来解决它。

t-SNE 并不是这样工作的:

以下专家来自t-SNE作者网站(https://lvdmaaten.github.io/tsne/):

Once I have a t-SNE map, how can I embed incoming test points in that map?

t-SNE learns a non-parametric mapping, which means that it does not learn an explicit function that maps data from the input space to the map. Therefore, it is not possible to embed test points in an existing map (although you could re-run t-SNE on the full dataset). A potential approach to deal with this would be to train a multivariate regressor to predict the map location from the input data. Alternatively, you could also make such a regressor minimize the t-SNE loss directly, which is what I did in this paper.

你可能对他的论文感兴趣:https://lvdmaaten.github.io/publications/papers/AISTATS_2009.pdf

这个网站除了非常酷之外还提供了大量关于 t-SNE 的信息:http://distill.pub/2016/misread-tsne/

在 Kaggle 上我也看到有人做这样的事情,这可能也很有趣: https://www.kaggle.com/cherzy/d/dalpozz/creditcardfraud/visualization-on-a-2d-map-with-t-sne

这是 Rtsne 包作者 (Jesse Krijthe) 的邮件回复:

Thank you for the very specific question. I had an earlier request for this and it is noted as an open issue on GitHub (https://github.com/jkrijthe/Rtsne/issues/6). The main reason I am hesitant to implement something like this is that, in a sense, there is no 'natural' way explain what a prediction means in terms of tsne. To me, tsne is a way to visualize a distance matrix. As such, a new sample would lead to a new distance matrix and hence a new visualization. So, my current thinking is that the only sensible way would be to rerun the tsne procedure on the train and test set combined.

Having said that, other people do think it makes sense to define predictions, for instance by keeping the train objects fixed in the map and finding good locations for the test objects (as was suggested in the issue). An approach I would personally prefer over this would be something like parametric tsne, which Laurens van der Maaten (the author of the tsne paper) explored a paper. However, this would best be implemented using something else than my package, because the parametric model is likely most effective if it is selected by the user.

So my suggestion would be to 1) refit the mapping using all data or 2) see if you can find an implementation of parametric tsne, the only one I know of would be Laurens's Matlab implementation.

Sorry I can not be of more help. If you come up with any other/better solutions, please let me know.