我想通过主题在双标图中的聚集方式来创建我的数据框的一个子集

Question

这是我正在处理的双标图之一。圆圈代表我想从

创建子集数据框的集群

如果我对顶部簇感兴趣，我如何 select 矩形内的数据 -.1 < PC1 <.1 & .8 < PC2 < 1.6？

我不能分享我的数据，但我们可以练习使用 iris 集。

library("ISLR")
biplot(prcomp(iris[,1:4]))

假设我对矩形中的数据感兴趣 -.125 < PC1 <-.75 & -.15 < PC2 < 1.0

如何识别该数据并从中创建子集？

Answer 1

您可以使用 .$x 访问投影点：

pc_res <- prcomp(iris[,1:4])
str(pc_res) # find that the data is stored in .$x
#> List of 5
#>  $ sdev    : num [1:4] 2.056 0.493 0.28 0.154
#>  $ rotation: num [1:4, 1:4] 0.3614 -0.0845 0.8567 0.3583 -0.6566 ...
#>   ..- attr(*, "dimnames")=List of 2
#>   .. ..$ : chr [1:4] "Sepal.Length" "Sepal.Width" "Petal.Length" "Petal.Width"
#>   .. ..$ : chr [1:4] "PC1" "PC2" "PC3" "PC4"
#>  $ center  : Named num [1:4] 5.84 3.06 3.76 1.2
#>   ..- attr(*, "names")= chr [1:4] "Sepal.Length" "Sepal.Width" "Petal.Length" "Petal.Width"
#>  $ scale   : logi FALSE
#>  $ x       : num [1:150, 1:4] -2.68 -2.71 -2.89 -2.75 -2.73 ...
#>   ..- attr(*, "dimnames")=List of 2
#>   .. ..$ : NULL
#>   .. ..$ : chr [1:4] "PC1" "PC2" "PC3" "PC4"
#>  - attr(*, "class")= chr "prcomp"
dframe <- as.data.frame(pc_res$x)
sub_res <- subset(x = dframe, subset = -.125 < dframe$PC1 &
                          dframe$PC1 <.75 &
                          -.15 < dframe$PC2 &
                          dframe$PC2 < 1.0)
head(sub_res)
#>             PC1       PC2         PC3          PC4
#> 54  0.183317720 0.8279590  0.17959139  0.093566840
#> 56  0.641669084 0.4182469 -0.04107609 -0.243116767
#> 60 -0.008745404 0.7230819 -0.28114143 -0.005618918
#> 62  0.511698557 0.1039812 -0.13054775  0.050719232
#> 63  0.264976508 0.5500365  0.69414683  0.057185519
#> 67  0.660283762 0.3529697 -0.32802753 -0.187878621

EDIT ：对于聚类，我会用算法（这里是kmeans）来做：

# if you want cluster from projection on (PC1, PC2)
dframe <- as.data.frame(prcomp(iris[,1:4])$x)
classif <- kmeans(x = dframe[,1:2], centers = 3, iter.max = 100, nstart = 10)
classif
#> K-means clustering with 3 clusters of sizes 61, 39, 50
#> 
#> Cluster means:
#>         PC1        PC2
#> 1  0.665676  0.3316042
#> 2  2.346527 -0.2739386
#> 3 -2.642415 -0.1908850
#> 
#> Clustering vector:
#>   [1] 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3
#>  [36] 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 2 1 2 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
#>  [71] 1 1 1 1 1 1 1 2 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 1 2 2 2
#> [106] 2 1 2 2 2 2 2 2 1 1 2 2 2 2 1 2 1 2 1 2 2 1 1 2 2 2 2 2 1 2 2 2 2 1 2
#> [141] 2 2 1 2 2 2 1 2 2 1
#> 
#> Within cluster sum of squares by cluster:
#> [1] 31.87959 18.87111 13.06924
#>  (between_SS / total_SS =  90.4 %)
#> 
#> Available components:
#> 
#> [1] "cluster"      "centers"      "totss"        "withinss"    
#> [5] "tot.withinss" "betweenss"    "size"         "iter"        
#> [9] "ifault"

# check visually your groups
str(classif)
#> List of 9
#>  $ cluster     : int [1:150] 3 3 3 3 3 3 3 3 3 3 ...
#>  $ centers     : num [1:3, 1:2] 0.666 2.347 -2.642 0.332 -0.274 ...
#>   ..- attr(*, "dimnames")=List of 2
#>   .. ..$ : chr [1:3] "1" "2" "3"
#>   .. ..$ : chr [1:2] "PC1" "PC2"
#>  $ totss       : num 666
#>  $ withinss    : num [1:3] 31.9 18.9 13.1
#>  $ tot.withinss: num 63.8
#>  $ betweenss   : num 602
#>  $ size        : int [1:3] 61 39 50
#>  $ iter        : int 2
#>  $ ifault      : int 0
#>  - attr(*, "class")= chr "kmeans"
classif$centers
#>         PC1        PC2
#> 1  0.665676  0.3316042
#> 2  2.346527 -0.2739386
#> 3 -2.642415 -0.1908850
dframe$group <- classif$cluster
plot(x = dframe$PC1, y = dframe$PC2, col = dframe$group) # so you want group with minimal center


result <- dframe[dframe$group == 1,] # or subset(x = dframe, subset = dframe$group == 1)
head(result)
#>           PC1         PC2         PC3           PC4 group
#> 52  0.9324885 -0.31833364  0.01801419  0.0005665121     1
#> 54  0.1833177  0.82795901  0.17959139  0.0935668402     1
#> 55  1.0881033 -0.07459068  0.30775790  0.1120205742     1
#> 56  0.6416691  0.41824687 -0.04107609 -0.2431167665     1
#> 57  1.0950607 -0.28346827 -0.16981024 -0.0835565724     1
#> 58 -0.7491227  1.00489096 -0.01230292 -0.0179077226     1

结语：SO 上有一个关于最佳聚类的非常好的图形答案：cluster-analysis-in-r-determine-the-optimal-number-of-clusters。还有一些包允许你使用 ggplot2 比如 FactomineR, ...

我想通过主题在双标图中的聚集方式来创建我的数据框的一个子集

I want to create a subset of my dataframe by how subjects cluster in the biplot

r

pca

unsupervised-learning

biplot