计算 R 中 PCA 的转换?
Calculate the transformation of a PCA in R?
我正在寻找表示从数据集到其 PC 的映射的权重。目的是设置一个"calibrated" fixed space e.g.三种葡萄酒以及新的观察结果,例如引入了一种新的葡萄酒,它可以在先前校准的 space 内分配,而无需更改固定的 PC 值。因此,可以通过执行应用于前三种排序的转换来适当地分配新的观察值。
library(ggbiplot)
data(wine)
wine.pca <- prcomp(wine, center = TRUE, scale. = TRUE)
print(ggbiplot(wine.pca, obs.scale = 1, var.scale = 1, groups = wine.class, ellipse = TRUE, circle = TRUE))
编辑: 将葡萄酒数据集拆分为训练数据,以获得我所谓的校准 space。
samp <- sample(nrow(wine), nrow(wine)*0.75)
wine.train <- wine[samp,]
然后使用训练数据对要验证的数据集进行子集化,例如
wine.valid <- wine[-samp,]
#PCA on training data
wine.train.pca <- prcomp(wine.train, center = TRUE, scale. = TRUE)
#use the transformation matrix from the training data to predict the validation data
pred <- predict(wine.train.pca, newdata = wine.valid)
随后,如何表示训练产生的校准 space 和转换的 validation/testing 数据在此 thread 中解决。
使用 prcomp
的 predict
函数很容易做到这一点。下面我通过将您的葡萄酒数据分成两部分来展示性能;训练和验证数据集。然后将在训练集上使用 prcomp 拟合 PCA 对验证 PCA 坐标的预测与从完整数据集导出的相同坐标进行比较:
library(ggbiplot)
data(wine)
# pca on whole dataset
wine.pca <- prcomp(wine, center = TRUE, scale. = TRUE)
# pca on training part of dataset, then project new data onto pca coordinates
set.seed(1)
samp <- sample(nrow(wine), nrow(wine)*0.75)
wine.train <- wine[samp,]
wine.valid <- wine[-samp,]
wine.train.pca <- prcomp(wine.train, center = TRUE, scale. = TRUE)
pred <- predict(wine.train.pca, newdata = wine.valid)
# plot original vs predicted pca coordinates
matplot(wine.pca$x[-samp,,1:4], pred[,1:4])
您还可以查看预测坐标和原始坐标之间的相关性,发现它们对于领先的 PC 非常高:
# correlation of predicted coordinates
abs(diag(cor(wine.pca$x[-samp,], pred[,])))
# PC1 PC2 PC3 PC4 PC5 PC6 PC7 PC8 PC9 PC10
# 0.9991291 0.9955028 0.9882540 0.9418268 0.9681989 0.9770390 0.9603593 0.8991734 0.8090762 0.9326917
# PC11 PC12 PC13
# 0.9270951 0.9596963 0.9397388
编辑:
这里是一个使用randomForest
分类的例子:
library(ggbiplot)
data(wine)
wine$class <- wine.class
# install.packages("randomForest")
library(randomForest)
set.seed(1)
train <- sample(nrow(wine), nrow(wine)*0.5)
valid <- seq(nrow(wine))[-train]
winetrain <- wine[train,]
winevalid <- wine[valid,]
modfit <- randomForest(class~., data=winetrain, nTree=500)
pred <- predict(modfit, newdata=winevalid, type='class')
每个变量的重要性可以通过以下方式返回:
importance(modfit) # importance of variables in predition
# MeanDecreaseGini
# Alcohol 8.5032770
# MalicAcid 1.3122286
# Ash 0.6827924
# AlcAsh 1.9517369
# Mg 1.3632713
# Phenols 2.7943536
# Flav 6.5798205
# NonFlavPhenols 1.1712744
# Proa 1.2412928
# Color 8.7097870
# Hue 5.2674082
# OD 6.6101764
# Proline 10.7032775
并且,预测准确率返回如下:
TAB <- table(pred, winevalid$class) # table of preditions vs. original classifications
TAB
# pred barolo grignolino barbera
# barolo 29 1 0
# grignolino 1 30 0
# barbera 0 1 27
sum(diag(TAB)) / sum(TAB) # overall accuracy
# [1] 0.9662921
我正在寻找表示从数据集到其 PC 的映射的权重。目的是设置一个"calibrated" fixed space e.g.三种葡萄酒以及新的观察结果,例如引入了一种新的葡萄酒,它可以在先前校准的 space 内分配,而无需更改固定的 PC 值。因此,可以通过执行应用于前三种排序的转换来适当地分配新的观察值。
library(ggbiplot)
data(wine)
wine.pca <- prcomp(wine, center = TRUE, scale. = TRUE)
print(ggbiplot(wine.pca, obs.scale = 1, var.scale = 1, groups = wine.class, ellipse = TRUE, circle = TRUE))
编辑: 将葡萄酒数据集拆分为训练数据,以获得我所谓的校准 space。
samp <- sample(nrow(wine), nrow(wine)*0.75)
wine.train <- wine[samp,]
然后使用训练数据对要验证的数据集进行子集化,例如
wine.valid <- wine[-samp,]
#PCA on training data
wine.train.pca <- prcomp(wine.train, center = TRUE, scale. = TRUE)
#use the transformation matrix from the training data to predict the validation data
pred <- predict(wine.train.pca, newdata = wine.valid)
随后,如何表示训练产生的校准 space 和转换的 validation/testing 数据在此 thread 中解决。
使用 prcomp
的 predict
函数很容易做到这一点。下面我通过将您的葡萄酒数据分成两部分来展示性能;训练和验证数据集。然后将在训练集上使用 prcomp 拟合 PCA 对验证 PCA 坐标的预测与从完整数据集导出的相同坐标进行比较:
library(ggbiplot)
data(wine)
# pca on whole dataset
wine.pca <- prcomp(wine, center = TRUE, scale. = TRUE)
# pca on training part of dataset, then project new data onto pca coordinates
set.seed(1)
samp <- sample(nrow(wine), nrow(wine)*0.75)
wine.train <- wine[samp,]
wine.valid <- wine[-samp,]
wine.train.pca <- prcomp(wine.train, center = TRUE, scale. = TRUE)
pred <- predict(wine.train.pca, newdata = wine.valid)
# plot original vs predicted pca coordinates
matplot(wine.pca$x[-samp,,1:4], pred[,1:4])
您还可以查看预测坐标和原始坐标之间的相关性,发现它们对于领先的 PC 非常高:
# correlation of predicted coordinates
abs(diag(cor(wine.pca$x[-samp,], pred[,])))
# PC1 PC2 PC3 PC4 PC5 PC6 PC7 PC8 PC9 PC10
# 0.9991291 0.9955028 0.9882540 0.9418268 0.9681989 0.9770390 0.9603593 0.8991734 0.8090762 0.9326917
# PC11 PC12 PC13
# 0.9270951 0.9596963 0.9397388
编辑:
这里是一个使用randomForest
分类的例子:
library(ggbiplot)
data(wine)
wine$class <- wine.class
# install.packages("randomForest")
library(randomForest)
set.seed(1)
train <- sample(nrow(wine), nrow(wine)*0.5)
valid <- seq(nrow(wine))[-train]
winetrain <- wine[train,]
winevalid <- wine[valid,]
modfit <- randomForest(class~., data=winetrain, nTree=500)
pred <- predict(modfit, newdata=winevalid, type='class')
每个变量的重要性可以通过以下方式返回:
importance(modfit) # importance of variables in predition
# MeanDecreaseGini
# Alcohol 8.5032770
# MalicAcid 1.3122286
# Ash 0.6827924
# AlcAsh 1.9517369
# Mg 1.3632713
# Phenols 2.7943536
# Flav 6.5798205
# NonFlavPhenols 1.1712744
# Proa 1.2412928
# Color 8.7097870
# Hue 5.2674082
# OD 6.6101764
# Proline 10.7032775
并且,预测准确率返回如下:
TAB <- table(pred, winevalid$class) # table of preditions vs. original classifications
TAB
# pred barolo grignolino barbera
# barolo 29 1 0
# grignolino 1 30 0
# barbera 0 1 27
sum(diag(TAB)) / sum(TAB) # overall accuracy
# [1] 0.9662921