如何使数据框中某个因素的水平在所有列中保持一致?

How to make the levels of a factor in a data frame consistent across all columns?

我有一个包含 5 个不同列的数据框:

         Test1   Test2   Test3  Test4  Test5 
Sample1  PASS    PASS    FAIL    WARN   WARN
Sample2  PASS    PASS    FAIL    PASS   WARN
Sample3  PASS    FAIL    FAIL    PASS   WARN
Sample4  PASS    FAIL    FAIL    PASS   WARN
Sample5  PASS    WARN    FAIL    WARN   WARN

在每一列中,每个级别都分配了不同的因素。 在第 1 列中,"PASS" 为 1。 在第 2 列中,"PASS" 为 2,“FAIL 为 1。 在第 3 列中,"FAIL" 为 1。 在第 4 列中,"PASS" 为 1,"WARN" 为 2。 在第 5 列中,"WARN" 是 1.

按字母顺序排列 我需要 "PASS" 在所有列中为 1,"WARN" 在所有列中为 2,并且 "FAIL" 在所有列中为 3,这样我就可以转换为矩阵并将其转换为热图。

目前,它根据特定列中显示的因素并按字母顺序将因素分配给级别。

如何在整个数据帧中保持它不变?

您可以通过循环 (lapply) 将数据集“df”的级别更改为相同的顺序,然后使用指定的 levels 再次转换为 factor 并分配它回到相应的列。

lvls <- c('PASS', 'WARN', 'FAIL')
df[] <-  lapply(df, factor, levels=lvls)
str(df)
# 'data.frame': 5 obs. of  5 variables:
# $ Test1: Factor w/ 3 levels "PASS","WARN",..: 1 1 1 1 1
# $ Test2: Factor w/ 3 levels "PASS","WARN",..: 1 1 3 3 2
# $ Test3: Factor w/ 3 levels "PASS","WARN",..: 3 3 3 3 3
# $ Test4: Factor w/ 3 levels "PASS","WARN",..: 2 1 1 1 2
# $ Test5: Factor w/ 3 levels "PASS","WARN",..: 2 2 2 2 2

如果您选择使用 data.table

library(data.table)
setDT(df)[, names(df):= lapply(.SD, factor, levels=lvls)]

setDT转换为“data.frame”为“data.table”,将数据集的列名赋值(:=)给重新转换后的因子列(lapply(..)). .SD 表示“数据表的子集”。

数据

df <- structure(list(Test1 = structure(c(1L, 1L, 1L, 1L, 1L), 
.Label = "PASS", class = "factor"), 
  Test2 = structure(c(2L, 2L, 1L, 1L, 3L), .Label = c("FAIL", 
 "PASS", "WARN"), class = "factor"), Test3 = structure(c(1L, 
 1L, 1L, 1L, 1L), .Label = "FAIL", class = "factor"), Test4 = 
 structure(c(2L, 1L, 1L, 1L, 2L), .Label = c("PASS", "WARN", "FAIL"), 
 class = "factor"), Test5 = structure(c(1L, 1L, 1L, 1L, 1L), .Label = 
"WARN", class = "factor")), .Names = c("Test1", 
"Test2", "Test3", "Test4", "Test5"), row.names = c("Sample1", 
"Sample2", "Sample3", "Sample4", "Sample5"), class = "data.frame")

使用dplyr:

library(dplyr)
df <- df %>% mutate_each(funs(factor(., levels = c('PASS', 'WARN', 'FAIL'))))

你得到:

#> str(df)
#'data.frame':  5 obs. of  5 variables:
# $ Test1: Factor w/ 3 levels "PASS","WARN",..: 1 1 1 1 1
# $ Test2: Factor w/ 3 levels "PASS","WARN",..: 1 1 3 3 2
# $ Test3: Factor w/ 3 levels "PASS","WARN",..: 3 3 3 3 3
# $ Test4: Factor w/ 3 levels "PASS","WARN",..: 2 1 1 1 2
# $ Test5: Factor w/ 3 levels "PASS","WARN",..: 2 2 2 2 2

一种更通用的方法,假设您可以在 data.frameNA 中包含其他 string 值:

library(magrittr)

fac = df %>% as.matrix %>% as.vector %>% unique
df1 = data.frame(lapply(df, factor, levels = fac[!is.na(fac)]))