如何使用 FactoMineR 包以编程方式确定主成分的列索引?
How to programmatically determine the column indices of principal components using FactoMineR package?
给定一个包含混合变量(即分类变量和连续变量)的数据框,例如,
digits = 0:9
# set seed for reproducibility
set.seed(17)
# function to create random string
createRandString <- function(n = 5000) {
a <- do.call(paste0, replicate(5, sample(LETTERS, n, TRUE), FALSE))
paste0(a, sprintf("%04d", sample(9999, n, TRUE)), sample(LETTERS, n, TRUE))
}
df <- data.frame(ID=c(1:10), name=sample(letters[1:10]),
studLoc=sample(createRandString(10)),
finalmark=sample(c(0:100),10),
subj1mark=sample(c(0:100),10),subj2mark=sample(c(0:100),10)
)
我使用包 FactoMineR
执行无监督特征选择
df.princomp <- FactoMineR::FAMD(df, graph = FALSE)
变量df.princomp
是一个列表。
此后,可视化我使用的主要组件
fviz_screeplot()
和 fviz_contrib()
喜欢,
#library(factoextra)
factoextra::fviz_screeplot(df.princomp, addlabels = TRUE,
barfill = "gray", barcolor = "black",
ylim = c(0, 50), xlab = "Principal Component",
ylab = "Percentage of explained variance",
main = "Principal Component (PC) for mixed variables")
factoextra::fviz_contrib(df.princomp, choice = "var",
axes = 1, top = 10, sort.val = c("desc"))
这给出了下面的图 1
和图2
Fig1的解释: Fig1是一个碎石图。 Scree Plot 是一个简单的线段图,显示数据中总方差的分数,如每个主成分 (PC) 所解释或表示的那样。所以我们可以看到前三台 PC 共同负责 43.8%
的总方差。现在问题自然就来了,"What are these variables?"。我已经在图 2 中展示了这一点。
图 2 的解释:该图可视化了主成分分析 (PCA) 结果中 rows/columns 的贡献。从这里我可以看到变量,name
、studLoc
和 finalMark
是最重要的变量,可用于进一步分析。
进一步分析-我卡在的地方:推导上述变量name
、studLoc
、finalMark
的贡献.我使用主成分变量 df.princomp
(见上文),如 df.princomp$quanti.var$contrib[,4]
和 df.princomp$quali.var$contrib[,2:3]
。
我必须手动指定列索引 [,2:3]
和 [,4]
。
我想要的:我想知道如何进行动态列索引分配,这样我就不必在列出 df.princomp
?
我已经查看了以下类似问题 1, , 3 and 4 但找不到我的解决方案?解决此问题的任何帮助或建议都会有所帮助。
有很多方法可以提取单个变量对 PC 的贡献。对于数字输入,可以 运行 一个带有 prcomp
的 PCA 并查看 $rotation
(我很快就谈过了,忘了你在这里有因素,所以 prcomp
不起作用直接地)。由于您使用的是 factoextra::fviz_contrib
,因此有必要检查该函数如何在幕后提取此信息。键入 factoextra::fviz_contrib
并读取函数:
> factoextra::fviz_contrib
function (X, choice = c("row", "col", "var", "ind", "quanti.var",
"quali.var", "group", "partial.axes"), axes = 1, fill = "steelblue",
color = "steelblue", sort.val = c("desc", "asc", "none"),
top = Inf, xtickslab.rt = 45, ggtheme = theme_minimal(),
...)
{
sort.val <- match.arg(sort.val)
choice = match.arg(choice)
title <- .build_title(choice[1], "Contribution", axes)
dd <- facto_summarize(X, element = choice, result = "contrib",
axes = axes)
contrib <- dd$contrib
names(contrib) <- rownames(dd)
theo_contrib <- 100/length(contrib)
if (length(axes) > 1) {
eig <- get_eigenvalue(X)[axes, 1]
theo_contrib <- sum(theo_contrib * eig)/sum(eig)
}
df <- data.frame(name = factor(names(contrib), levels = names(contrib)),
contrib = contrib)
if (choice == "quanti.var") {
df$Groups <- .get_quanti_var_groups(X)
if (missing(fill))
fill <- "Groups"
if (missing(color))
color <- "Groups"
}
p <- ggpubr::ggbarplot(df, x = "name", y = "contrib", fill = fill,
color = color, sort.val = sort.val, top = top, main = title,
xlab = FALSE, ylab = "Contributions (%)", xtickslab.rt = xtickslab.rt,
ggtheme = ggtheme, sort.by.groups = FALSE, ...) + geom_hline(yintercept = theo_contrib,
linetype = 2, color = "red")
p
}
<environment: namespace:factoextra>
所以它实际上只是从同一个包中调用 facto_summarize
。以此类推,您可以做同样的事情,只需调用:
> dd <- factoextra::facto_summarize(df.princomp, element = "var", result = "contrib", axes = 1)
> dd
name contrib
ID ID 0.9924561
finalmark finalmark 21.4149175
subj1mark subj1mark 7.1874438
subj2mark subj2mark 16.6831560
name name 26.8610132
studLoc studLoc 26.8610132
这就是你图2对应的table,PC2用axes = 2
等等。
关于 "how to programmatically determine the column indices of the PCs",我不是 100% 确定我明白你想要什么,但如果你只想说 "finalmark" 列,抓住它对 PC3 的贡献,你可以执行以下操作:
library(tidyverse)
# make a tidy table of all column names in the original df with their contributions to all PCs
contribution_df <- map_df(set_names(1:5), ~factoextra::facto_summarize(df.princomp, element = "var", result = "contrib", axes = .x), .id = "PC")
# get the contribution of column 'finalmark' by name
contribution_df %>%
filter(name == "finalmark")
# get the contribution of column 'finalmark' to PC3
contribution_df %>%
filter(name == "finalmark" & PC == 3)
# or, just the numeric value of contribution
filter(contribution_df, name == "finalmark" & PC == 3)$contrib
顺便说一句,我认为 ID
在您的示例中被视为数字而不是因子,但由于它只是一个示例,所以我不会为此烦恼。
不确定我对您问题的解释是否正确,如果不正确,请见谅。据我所知,您正在使用 PCA 作为初始工具来向您展示哪些变量在解释数据集时最重要。然后您想要返回原始数据,select 这些变量无需每次手动编码即可快速返回,并将它们用于其他分析。
如果这是正确的,那么我已经保存了贡献图中的数据,筛选出贡献最大的变量,并使用该结果创建了一个仅包含这些变量的新数据框。
digits = 0:9
# set seed for reproducibility
set.seed(17)
# function to create random string
createRandString <- function(n = 5000) {
a <- do.call(paste0, replicate(5, sample(LETTERS, n, TRUE), FALSE))
paste0(a, sprintf("%04d", sample(9999, n, TRUE)), sample(LETTERS, n, TRUE))
}
df <- data.frame(ID=c(1:10), name=sample(letters[1:10]),
studLoc=sample(createRandString(10)),
finalmark=sample(c(0:100),10),
subj1mark=sample(c(0:100),10),subj2mark=sample(c(0:100),10)
)
df.princomp <- FactoMineR::FAMD(df, graph = FALSE)
factoextra::fviz_screeplot(df.princomp, addlabels = TRUE,
barfill = "gray", barcolor = "black",
ylim = c(0, 50), xlab = "Principal Component",
ylab = "Percentage of explained variance",
main = "Principal Component (PC) for mixed variables")
#find the top contributing variables to the overall variation in the dataset
#here I am choosing the top 10 variables (although we only have 6 in our df).
#note you can specify which axes you want to look at with axes=, you can even do axes=c(1,2)
f<-factoextra::fviz_contrib(df.princomp, choice = "var",
axes = c(1), top = 10, sort.val = c("desc"))
#save data from contribution plot
dat<-f$data
#filter out ID's that are higher than, say, 20
r<-rownames(dat[dat$contrib>20,])
#extract these from your original data frame into a new data frame for further analysis
new<-df[r]
new
#finalmark name studLoc
#1 53 b POTYQ0002N
#2 73 i LWMTW1195I
#3 95 d VTUGO1685F
#4 39 f YCGGS5755N
#5 97 c GOSWE3283C
#6 58 g APBQD6181U
#7 67 a VUJOG1460V
#8 64 h YXOGP1897F
#9 15 j NFUOB6042V
#10 81 e QYTHG0783G
根据你的评论,你说你想'Find variables with value greater than 5 in Dim.1 AND Dim.2 and save these variables to a new data frame',我会这样做:
#top contributors to both Dim 1 and 2
f<-factoextra::fviz_contrib(df.princomp, choice = "var",
axes = c(1,2), top = 10, sort.val = c("desc"))
#save data from contribution plot
dat<-f$data
#filter out ID's that are higher than 5
r<-rownames(dat[dat$contrib>5,])
#extract these from your original data frame into a new data frame for further analysis
new<-df[r]
new
(这将所有原始变量保留在我们的新数据框中,因为它们对总方差的贡献都超过 5%)
给定一个包含混合变量(即分类变量和连续变量)的数据框,例如,
digits = 0:9
# set seed for reproducibility
set.seed(17)
# function to create random string
createRandString <- function(n = 5000) {
a <- do.call(paste0, replicate(5, sample(LETTERS, n, TRUE), FALSE))
paste0(a, sprintf("%04d", sample(9999, n, TRUE)), sample(LETTERS, n, TRUE))
}
df <- data.frame(ID=c(1:10), name=sample(letters[1:10]),
studLoc=sample(createRandString(10)),
finalmark=sample(c(0:100),10),
subj1mark=sample(c(0:100),10),subj2mark=sample(c(0:100),10)
)
我使用包 FactoMineR
df.princomp <- FactoMineR::FAMD(df, graph = FALSE)
变量df.princomp
是一个列表。
此后,可视化我使用的主要组件
fviz_screeplot()
和 fviz_contrib()
喜欢,
#library(factoextra)
factoextra::fviz_screeplot(df.princomp, addlabels = TRUE,
barfill = "gray", barcolor = "black",
ylim = c(0, 50), xlab = "Principal Component",
ylab = "Percentage of explained variance",
main = "Principal Component (PC) for mixed variables")
factoextra::fviz_contrib(df.princomp, choice = "var",
axes = 1, top = 10, sort.val = c("desc"))
这给出了下面的图 1
和图2
Fig1的解释: Fig1是一个碎石图。 Scree Plot 是一个简单的线段图,显示数据中总方差的分数,如每个主成分 (PC) 所解释或表示的那样。所以我们可以看到前三台 PC 共同负责 43.8%
的总方差。现在问题自然就来了,"What are these variables?"。我已经在图 2 中展示了这一点。
图 2 的解释:该图可视化了主成分分析 (PCA) 结果中 rows/columns 的贡献。从这里我可以看到变量,name
、studLoc
和 finalMark
是最重要的变量,可用于进一步分析。
进一步分析-我卡在的地方:推导上述变量name
、studLoc
、finalMark
的贡献.我使用主成分变量 df.princomp
(见上文),如 df.princomp$quanti.var$contrib[,4]
和 df.princomp$quali.var$contrib[,2:3]
。
我必须手动指定列索引 [,2:3]
和 [,4]
。
我想要的:我想知道如何进行动态列索引分配,这样我就不必在列出 df.princomp
?
我已经查看了以下类似问题 1,
有很多方法可以提取单个变量对 PC 的贡献。对于数字输入,可以 运行 一个带有 prcomp
的 PCA 并查看 $rotation
(我很快就谈过了,忘了你在这里有因素,所以 prcomp
不起作用直接地)。由于您使用的是 factoextra::fviz_contrib
,因此有必要检查该函数如何在幕后提取此信息。键入 factoextra::fviz_contrib
并读取函数:
> factoextra::fviz_contrib
function (X, choice = c("row", "col", "var", "ind", "quanti.var",
"quali.var", "group", "partial.axes"), axes = 1, fill = "steelblue",
color = "steelblue", sort.val = c("desc", "asc", "none"),
top = Inf, xtickslab.rt = 45, ggtheme = theme_minimal(),
...)
{
sort.val <- match.arg(sort.val)
choice = match.arg(choice)
title <- .build_title(choice[1], "Contribution", axes)
dd <- facto_summarize(X, element = choice, result = "contrib",
axes = axes)
contrib <- dd$contrib
names(contrib) <- rownames(dd)
theo_contrib <- 100/length(contrib)
if (length(axes) > 1) {
eig <- get_eigenvalue(X)[axes, 1]
theo_contrib <- sum(theo_contrib * eig)/sum(eig)
}
df <- data.frame(name = factor(names(contrib), levels = names(contrib)),
contrib = contrib)
if (choice == "quanti.var") {
df$Groups <- .get_quanti_var_groups(X)
if (missing(fill))
fill <- "Groups"
if (missing(color))
color <- "Groups"
}
p <- ggpubr::ggbarplot(df, x = "name", y = "contrib", fill = fill,
color = color, sort.val = sort.val, top = top, main = title,
xlab = FALSE, ylab = "Contributions (%)", xtickslab.rt = xtickslab.rt,
ggtheme = ggtheme, sort.by.groups = FALSE, ...) + geom_hline(yintercept = theo_contrib,
linetype = 2, color = "red")
p
}
<environment: namespace:factoextra>
所以它实际上只是从同一个包中调用 facto_summarize
。以此类推,您可以做同样的事情,只需调用:
> dd <- factoextra::facto_summarize(df.princomp, element = "var", result = "contrib", axes = 1)
> dd
name contrib
ID ID 0.9924561
finalmark finalmark 21.4149175
subj1mark subj1mark 7.1874438
subj2mark subj2mark 16.6831560
name name 26.8610132
studLoc studLoc 26.8610132
这就是你图2对应的table,PC2用axes = 2
等等。
关于 "how to programmatically determine the column indices of the PCs",我不是 100% 确定我明白你想要什么,但如果你只想说 "finalmark" 列,抓住它对 PC3 的贡献,你可以执行以下操作:
library(tidyverse)
# make a tidy table of all column names in the original df with their contributions to all PCs
contribution_df <- map_df(set_names(1:5), ~factoextra::facto_summarize(df.princomp, element = "var", result = "contrib", axes = .x), .id = "PC")
# get the contribution of column 'finalmark' by name
contribution_df %>%
filter(name == "finalmark")
# get the contribution of column 'finalmark' to PC3
contribution_df %>%
filter(name == "finalmark" & PC == 3)
# or, just the numeric value of contribution
filter(contribution_df, name == "finalmark" & PC == 3)$contrib
顺便说一句,我认为 ID
在您的示例中被视为数字而不是因子,但由于它只是一个示例,所以我不会为此烦恼。
不确定我对您问题的解释是否正确,如果不正确,请见谅。据我所知,您正在使用 PCA 作为初始工具来向您展示哪些变量在解释数据集时最重要。然后您想要返回原始数据,select 这些变量无需每次手动编码即可快速返回,并将它们用于其他分析。
如果这是正确的,那么我已经保存了贡献图中的数据,筛选出贡献最大的变量,并使用该结果创建了一个仅包含这些变量的新数据框。
digits = 0:9
# set seed for reproducibility
set.seed(17)
# function to create random string
createRandString <- function(n = 5000) {
a <- do.call(paste0, replicate(5, sample(LETTERS, n, TRUE), FALSE))
paste0(a, sprintf("%04d", sample(9999, n, TRUE)), sample(LETTERS, n, TRUE))
}
df <- data.frame(ID=c(1:10), name=sample(letters[1:10]),
studLoc=sample(createRandString(10)),
finalmark=sample(c(0:100),10),
subj1mark=sample(c(0:100),10),subj2mark=sample(c(0:100),10)
)
df.princomp <- FactoMineR::FAMD(df, graph = FALSE)
factoextra::fviz_screeplot(df.princomp, addlabels = TRUE,
barfill = "gray", barcolor = "black",
ylim = c(0, 50), xlab = "Principal Component",
ylab = "Percentage of explained variance",
main = "Principal Component (PC) for mixed variables")
#find the top contributing variables to the overall variation in the dataset
#here I am choosing the top 10 variables (although we only have 6 in our df).
#note you can specify which axes you want to look at with axes=, you can even do axes=c(1,2)
f<-factoextra::fviz_contrib(df.princomp, choice = "var",
axes = c(1), top = 10, sort.val = c("desc"))
#save data from contribution plot
dat<-f$data
#filter out ID's that are higher than, say, 20
r<-rownames(dat[dat$contrib>20,])
#extract these from your original data frame into a new data frame for further analysis
new<-df[r]
new
#finalmark name studLoc
#1 53 b POTYQ0002N
#2 73 i LWMTW1195I
#3 95 d VTUGO1685F
#4 39 f YCGGS5755N
#5 97 c GOSWE3283C
#6 58 g APBQD6181U
#7 67 a VUJOG1460V
#8 64 h YXOGP1897F
#9 15 j NFUOB6042V
#10 81 e QYTHG0783G
根据你的评论,你说你想'Find variables with value greater than 5 in Dim.1 AND Dim.2 and save these variables to a new data frame',我会这样做:
#top contributors to both Dim 1 and 2
f<-factoextra::fviz_contrib(df.princomp, choice = "var",
axes = c(1,2), top = 10, sort.val = c("desc"))
#save data from contribution plot
dat<-f$data
#filter out ID's that are higher than 5
r<-rownames(dat[dat$contrib>5,])
#extract these from your original data frame into a new data frame for further analysis
new<-df[r]
new
(这将所有原始变量保留在我们的新数据框中,因为它们对总方差的贡献都超过 5%)