在ggplot散点图中将不同变量的多个因子水平设置为相同颜色的简洁方法

Question

考虑以下简化的数据框：

x =  c(.35, .35, .37, .5, .55, .56, .9, .91, .89)
y = c(.35, .36, .35, .22, .27, .25, .88, .9, .87)
clu3 = as.factor(c(31,31,31,32,32,32,33,33,33))
clu4 = as.factor(c(41,41,41,42,43,43,44,44,44))

df = data.frame (x,y,clu3,clu4)

在我的分析中，三个聚类首先适合数据集（clu3，其因子水平为 31、32 和 33）。然后四个集群也适合数据集（clu4，其因子水平为 41、42、43、44），然后是五个集群和六个集群，依此类推。为简单起见，我只包含了拟合三个和四个聚类的结果。

我可以使用以下方法绘制每个“运行”（即三簇运行和四簇运行）的结果：

ggplot(df, aes(x=x, y=y, color=clu3)) + 
  geom_point(size=4)+
  theme_bw()+
  ggtitle(paste("Three-cluster scatterplot"))+
  theme(plot.title = element_text(hjust = 0.5))

ggplot(df, aes(x=x, y=y, color=clu4)) + 
  geom_point(size=4)+
  theme_bw()+
  ggtitle(paste("Four-cluster scatterplot"))+
  theme(plot.title = element_text(hjust = 0.5))

现在，我正在 ggplot 中指定簇颜色。但在我的示例中，集群 31 和 41 是相同的（但来自不同的“运行s”），集群 33 和 44 也是相同的。在后面的运行中也有额外的相同集群（当将五个集群拟合到数据时，六个，等等）。我想做的是以简洁的方式指定不同变量（在本例中为 clu3 和 clu4）的因子水平的颜色。从感知上讲，它应该是这样的：

"31" | "41" = "purple"
"33" | "44" = "green"
"32"        = "blue"
"42"        = "orange"
"43"        = "yellow"

我猜解决方案涉及 scale_fill_manual，并且我已经阅读了使因子水平颜色在绘图中保持一致的方法（即使未使用某些因子水平）。但在所有这些示例中，因子水平都是 相同的 变量，而我想使 来自不同变量的不同因子水平一致的颜色 。非常感谢任何建议！

Answer 1

注意：此解决方案适用于示例数据中的双集群情况。您将需要使用不同的逻辑（first/last 除外）为 > 2 个集群创建 new_clu。

一种解决方案是将数据转换为“长”形式，按x和y分组，然后根据[=17]的值创建一个新变量new_clu =] 和 clu4.

注意：我使用的是 gather，但您可以使用较新的 pivot_longer。不相同的属性会产生警告，可以忽略。

df %>% 
  gather(Cluster, Val, 3:4) %>% 
  group_by(x, y) %>% 
  mutate(new_clu = case_when(
    first(Val) == 31 & last(Val) == 41 ~ paste0(first(Val), "/", last(Val)),
    first(Val) == 33 & last(Val) == 44 ~ paste0(first(Val), "/", last(Val)),
    TRUE ~ Val
  )
)

结果：

# A tibble: 18 x 5
# Groups:   x, y [9]
       x     y Cluster Val   new_clu
   <dbl> <dbl> <chr>   <chr> <chr>  
 1  0.35  0.35 clu3    31    31/41  
 2  0.35  0.36 clu3    31    31/41  
 3  0.37  0.35 clu3    31    31/41  
 4  0.5   0.22 clu3    32    32     
 5  0.55  0.27 clu3    32    32     
 6  0.56  0.25 clu3    32    32     
 7  0.9   0.88 clu3    33    33/44  
 8  0.91  0.9  clu3    33    33/44  
 9  0.89  0.87 clu3    33    33/44  
10  0.35  0.35 clu4    41    31/41  
11  0.35  0.36 clu4    41    31/41  
12  0.37  0.35 clu4    41    31/41  
13  0.5   0.22 clu4    42    42     
14  0.55  0.27 clu4    43    43     
15  0.56  0.25 clu4    43    43     
16  0.9   0.88 clu4    44    33/44  
17  0.91  0.9  clu4    44    33/44  
18  0.89  0.87 clu4    44    33/44

现在您可以将其传递给 ggplot 并在 new_clu 上着色：

df %>% 
  gather(Cluster, Val, 3:4) %>% 
  group_by(x, y) %>% 
  mutate(new_clu = case_when(
    first(Val) == 31 & last(Val) == 41 ~ paste0(first(Val), "/", last(Val)),
    first(Val) == 33 & last(Val) == 44 ~ paste0(first(Val), "/", last(Val)),
    TRUE ~ Val
  )
) %>% 
ggplot(aes(x, y)) + 
geom_point(aes(color = new_clu))

结果：

Answer 2

正如您所建议的，使用 scale_fill_manual 或 scale_color_manual 是一个有效的选项。您可以编写一个函数来匹配两个聚类之间的颜色（例如，相对于第一个或先前聚类的聚类）。

这是一种匹配颜色并将其按顺序应用于多个群集的方法：

library(ggplot2)
x <- c(.35, .35, .37, .5, .55, .56, .9, .91, .89)
y <- c(.35, .36, .35, .22, .27, .25, .88, .9, .87)
clu3 <- factor(c(31, 31, 31, 32, 32, 32, 33, 33, 33))
clu4 <- factor(c(41, 41, 41, 42, 43, 43, 44, 44, 44))
clu5 <- factor(c(51, 51, 52, 53, 54, 54, 55, 55, 55)) # added a few more clusters
clu6 <- factor(c(61, 61, 62, 63, 64, 64, 65, 66, 65))
df <- data.frame(x, y, clu3, clu4, clu5, clu6)

## assign specific colors to matching clusters; rest: use same colors
matchCol <- function(fac1, fac2, pal=c("#999999", "#E69F00", "#56B4E9",
                                       "#009E73", "#F0E442", "#0072B2",
                                       "#D55E00", "#CC79A7")){
    maxl <- max(length(levels(fac1)), length(levels(fac2)))
    if(length(pal) < maxl) { # make sure you have enough colors
        warning("Not enough colors; using scales::hue_pal")
        pal <- scales::hue_pal()(maxl)
    }
    tab <- as.matrix(table(fac1, fac2)) > 0
    rs1 <- which(rowSums(tab) == 1)
    rs2 <- apply(tab[rs1, , drop=FALSE], 1, which.max)
    f1 <- setNames(pal[seq_along(levels(fac1))], levels(fac1))
    f2 <- setNames(NA[seq_along(levels(fac2))], levels(fac2))
    f2[levels(fac2)[rs2]] <- f1[levels(fac1)[rs1]]              # add matching colors
    f2n <- names(f2)
    if(!identical(fac1, fac2)) f2n[rs2] <- paste0(levels(fac1)[rs1], " | ", levels(fac2)[rs2])
    f2[is.na(f2)] <- setdiff(pal, f2)[seq_along(f2[is.na(f2)])] # fill in remaining colors
    list(fac1=f1, fac2=f2, f2n=f2n )     # you only need f2 here, so could simplify
}

# then plot using matchCol function, e.g.:
ggplot(df, aes(x=x, y=y, color=clu4)) + 
    geom_point(size=4)+
    theme_bw()+
    ggtitle(paste("Four-cluster scatterplot"))+
    theme(plot.title = element_text(hjust = 0.5)) + 
    scale_color_manual(values=matchCol(clu3, clu4)$fac2,
                       labels=matchCol(clu3, clu4)$f2)

# or generalized
clusts <- grep("clu", colnames(df), value=TRUE)
p1 <- lapply(clusts, function(z){
    mc <- matchCol(get(clusts[1]), get(z)) 
    ggplot(df, aes_string(x="x", y="y", color=z)) + 
        geom_point(size=4)+
        theme_bw()+
        ggtitle(paste0(gsub("clu", "", z),"-cluster scatterplot"))+
        theme(plot.title = element_text(hjust = 0.5)) + 
        scale_color_manual(values=mc$fac2, labels=mc$f2)
    }
)
cowplot::plot_grid(plotlist = p1)

# same, relative to previous clustering:
p2 <- lapply(seq_along(clusts), function(z){
    mc <- matchCol(get(clusts[max(1, z-1)]), get(clusts[z]))
    ggplot(df, aes_string(x="x", y="y", color=clusts[z])) + 
        geom_point(size=4)+
        theme_bw()+
        ggtitle(paste0(gsub("clu", "", clusts[z]),"-cluster scatterplot"))+
        theme(plot.title = element_text(hjust = 0.5)) + 
        scale_color_manual(values=mc$fac2, labels=mc$f2)
  }
)
    
cowplot::plot_grid(plotlist = p2)

^{由 reprex package (v0.3.0)}

于 2020-12-17 创建

在ggplot散点图中将不同变量的多个因子水平设置为相同颜色的简洁方法

Succinct way to set multiple factor levels of different variables to same color in ggplot scatterplots

r

colors

scatter-plot

factors

ggplot2