如何使数据框中某个因素的水平在所有列中保持一致?
How to make the levels of a factor in a data frame consistent across all columns?
我有一个包含 5 个不同列的数据框:
Test1 Test2 Test3 Test4 Test5
Sample1 PASS PASS FAIL WARN WARN
Sample2 PASS PASS FAIL PASS WARN
Sample3 PASS FAIL FAIL PASS WARN
Sample4 PASS FAIL FAIL PASS WARN
Sample5 PASS WARN FAIL WARN WARN
在每一列中,每个级别都分配了不同的因素。
在第 1 列中,"PASS" 为 1。
在第 2 列中,"PASS" 为 2,“FAIL 为 1。
在第 3 列中,"FAIL" 为 1。
在第 4 列中,"PASS" 为 1,"WARN" 为 2。
在第 5 列中,"WARN" 是 1.
按字母顺序排列
我需要 "PASS" 在所有列中为 1,"WARN" 在所有列中为 2,并且 "FAIL" 在所有列中为 3,这样我就可以转换为矩阵并将其转换为热图。
目前,它根据特定列中显示的因素并按字母顺序将因素分配给级别。
如何在整个数据帧中保持它不变?
您可以通过循环 (lapply
) 将数据集“df”的级别更改为相同的顺序,然后使用指定的 levels
再次转换为 factor
并分配它回到相应的列。
lvls <- c('PASS', 'WARN', 'FAIL')
df[] <- lapply(df, factor, levels=lvls)
str(df)
# 'data.frame': 5 obs. of 5 variables:
# $ Test1: Factor w/ 3 levels "PASS","WARN",..: 1 1 1 1 1
# $ Test2: Factor w/ 3 levels "PASS","WARN",..: 1 1 3 3 2
# $ Test3: Factor w/ 3 levels "PASS","WARN",..: 3 3 3 3 3
# $ Test4: Factor w/ 3 levels "PASS","WARN",..: 2 1 1 1 2
# $ Test5: Factor w/ 3 levels "PASS","WARN",..: 2 2 2 2 2
如果您选择使用 data.table
library(data.table)
setDT(df)[, names(df):= lapply(.SD, factor, levels=lvls)]
setDT
转换为“data.frame”为“data.table”,将数据集的列名赋值(:=
)给重新转换后的因子列(lapply(..)
). .SD
表示“数据表的子集”。
数据
df <- structure(list(Test1 = structure(c(1L, 1L, 1L, 1L, 1L),
.Label = "PASS", class = "factor"),
Test2 = structure(c(2L, 2L, 1L, 1L, 3L), .Label = c("FAIL",
"PASS", "WARN"), class = "factor"), Test3 = structure(c(1L,
1L, 1L, 1L, 1L), .Label = "FAIL", class = "factor"), Test4 =
structure(c(2L, 1L, 1L, 1L, 2L), .Label = c("PASS", "WARN", "FAIL"),
class = "factor"), Test5 = structure(c(1L, 1L, 1L, 1L, 1L), .Label =
"WARN", class = "factor")), .Names = c("Test1",
"Test2", "Test3", "Test4", "Test5"), row.names = c("Sample1",
"Sample2", "Sample3", "Sample4", "Sample5"), class = "data.frame")
使用dplyr
:
library(dplyr)
df <- df %>% mutate_each(funs(factor(., levels = c('PASS', 'WARN', 'FAIL'))))
你得到:
#> str(df)
#'data.frame': 5 obs. of 5 variables:
# $ Test1: Factor w/ 3 levels "PASS","WARN",..: 1 1 1 1 1
# $ Test2: Factor w/ 3 levels "PASS","WARN",..: 1 1 3 3 2
# $ Test3: Factor w/ 3 levels "PASS","WARN",..: 3 3 3 3 3
# $ Test4: Factor w/ 3 levels "PASS","WARN",..: 2 1 1 1 2
# $ Test5: Factor w/ 3 levels "PASS","WARN",..: 2 2 2 2 2
一种更通用的方法,假设您可以在 data.frame
和 NA
中包含其他 string
值:
library(magrittr)
fac = df %>% as.matrix %>% as.vector %>% unique
df1 = data.frame(lapply(df, factor, levels = fac[!is.na(fac)]))
我有一个包含 5 个不同列的数据框:
Test1 Test2 Test3 Test4 Test5
Sample1 PASS PASS FAIL WARN WARN
Sample2 PASS PASS FAIL PASS WARN
Sample3 PASS FAIL FAIL PASS WARN
Sample4 PASS FAIL FAIL PASS WARN
Sample5 PASS WARN FAIL WARN WARN
在每一列中,每个级别都分配了不同的因素。 在第 1 列中,"PASS" 为 1。 在第 2 列中,"PASS" 为 2,“FAIL 为 1。 在第 3 列中,"FAIL" 为 1。 在第 4 列中,"PASS" 为 1,"WARN" 为 2。 在第 5 列中,"WARN" 是 1.
按字母顺序排列 我需要 "PASS" 在所有列中为 1,"WARN" 在所有列中为 2,并且 "FAIL" 在所有列中为 3,这样我就可以转换为矩阵并将其转换为热图。
目前,它根据特定列中显示的因素并按字母顺序将因素分配给级别。
如何在整个数据帧中保持它不变?
您可以通过循环 (lapply
) 将数据集“df”的级别更改为相同的顺序,然后使用指定的 levels
再次转换为 factor
并分配它回到相应的列。
lvls <- c('PASS', 'WARN', 'FAIL')
df[] <- lapply(df, factor, levels=lvls)
str(df)
# 'data.frame': 5 obs. of 5 variables:
# $ Test1: Factor w/ 3 levels "PASS","WARN",..: 1 1 1 1 1
# $ Test2: Factor w/ 3 levels "PASS","WARN",..: 1 1 3 3 2
# $ Test3: Factor w/ 3 levels "PASS","WARN",..: 3 3 3 3 3
# $ Test4: Factor w/ 3 levels "PASS","WARN",..: 2 1 1 1 2
# $ Test5: Factor w/ 3 levels "PASS","WARN",..: 2 2 2 2 2
如果您选择使用 data.table
library(data.table)
setDT(df)[, names(df):= lapply(.SD, factor, levels=lvls)]
setDT
转换为“data.frame”为“data.table”,将数据集的列名赋值(:=
)给重新转换后的因子列(lapply(..)
). .SD
表示“数据表的子集”。
数据
df <- structure(list(Test1 = structure(c(1L, 1L, 1L, 1L, 1L),
.Label = "PASS", class = "factor"),
Test2 = structure(c(2L, 2L, 1L, 1L, 3L), .Label = c("FAIL",
"PASS", "WARN"), class = "factor"), Test3 = structure(c(1L,
1L, 1L, 1L, 1L), .Label = "FAIL", class = "factor"), Test4 =
structure(c(2L, 1L, 1L, 1L, 2L), .Label = c("PASS", "WARN", "FAIL"),
class = "factor"), Test5 = structure(c(1L, 1L, 1L, 1L, 1L), .Label =
"WARN", class = "factor")), .Names = c("Test1",
"Test2", "Test3", "Test4", "Test5"), row.names = c("Sample1",
"Sample2", "Sample3", "Sample4", "Sample5"), class = "data.frame")
使用dplyr
:
library(dplyr)
df <- df %>% mutate_each(funs(factor(., levels = c('PASS', 'WARN', 'FAIL'))))
你得到:
#> str(df)
#'data.frame': 5 obs. of 5 variables:
# $ Test1: Factor w/ 3 levels "PASS","WARN",..: 1 1 1 1 1
# $ Test2: Factor w/ 3 levels "PASS","WARN",..: 1 1 3 3 2
# $ Test3: Factor w/ 3 levels "PASS","WARN",..: 3 3 3 3 3
# $ Test4: Factor w/ 3 levels "PASS","WARN",..: 2 1 1 1 2
# $ Test5: Factor w/ 3 levels "PASS","WARN",..: 2 2 2 2 2
一种更通用的方法,假设您可以在 data.frame
和 NA
中包含其他 string
值:
library(magrittr)
fac = df %>% as.matrix %>% as.vector %>% unique
df1 = data.frame(lapply(df, factor, levels = fac[!is.na(fac)]))