如何用“&”重新调整两个级别组合的因子

How to relevel the factor that combines two levels with "&"

我的数据有一个意想不到的因素,它结合了 & 的两个水平:"intermediate 7 & 8"

重新调整此值的最佳方法是什么?以后有可能这个因子也可以这样组合,比如"Beginner 3 & 4"等

#Relevel factors
Sample <- as.factor(c("Beginner 1","intermediate 8", "intermediate 7 & 8", 
                     "Expert 2","Expert 10","Beginner 3 & 4","Beginner 5",
                     "Beginner 10", "intermediate 1", "Expert 1", NA))
newLevel <- factor(c("NA", paste0("Beginner ", 1:10), paste0("intermediate ", 1:10), 
                   paste0("Expert ", 1:10)))
newSample <- factor(Sample, levels=newLevel)

newSample
# [1] Beginner 1     intermediate 8 <NA>           Expert 2       Expert 10     
# [6] Beginner 3     Beginner 5     Beginner 10    intermediate 1 Expert 1      
# [11] <NA>          
#   31 Levels: NA Beginner 1 Beginner 2 Beginner 3 Beginner 4 Beginner 5 ... Expert 10

#Change factor to Numeric
SampleNum <- as.numeric(factor(Sample, levels=newLevel))
SampleNum
# [1]  2 19 NA 23 31  4  6 11 12 22 NA

所以 "intermediate 7 & 8" 被认为是 NA。它必须介于 "intermediate 7" 和 "intermediate 8" 之间。

有什么好主意可以分解它并可以转换为数字吗?

如果出现两次以获得准数值 suffix,您可以去掉数字并计算 mean

suffix <- sapply(strsplit(trimws(gsub("\D+", " ", levels(Sample))), " "), function(x) 
  mean(as.numeric(x)))

然后,要获得 prefixes,请使用 cat.df 作为分配矩阵,以正确的顺序将类别转换为更高的数字。

cat.df <- data.frame(c("Beginner", "intermediate", "Expert"),
                      (1:3)*100)
prefix <- sapply(gsub("(\D+)\s.*", "\1", levels(Sample)), function(x, y) 
  cat.df[match(x, y), 2], cat.df[, 1])

这就是重新调整 Sample 向量的全部内容。

new.Sample <- factor(Sample, levels=levels(Sample)[order(prefix + suffix)])
#  [1] Beginner 1         intermediate 8     intermediate 7 & 8 Expert 2          
#  [5] Expert 10          Beginner 3 & 4     Beginner 5         Beginner 10       
#  [9] intermediate 1     Expert 1           <NA>              
# 10 Levels: Beginner 1 Beginner 3 & 4 Beginner 5 Beginner 10 ... Expert 10

检查

data.frame(sort(new.Sample), as.numeric(sort(new.Sample)))
#      sort.new.Sample. as.numeric.sort.new.Sample..
# 1          Beginner 1                            1
# 2      Beginner 3 & 4                            2
# 3          Beginner 5                            3
# 4         Beginner 10                            4
# 5      intermediate 1                            5
# 6  intermediate 7 & 8                            6
# 7      intermediate 8                            7
# 8            Expert 1                            8
# 9            Expert 2                            9
# 10          Expert 10                           10

转换为数字

as.numeric(new.Sample)
# [1]  1  7  6  9 10  2  3  4  5  8 NA

数据

Sample <- structure(c(1L, 10L, 9L, 7L, 6L, 3L, 4L, 2L, 8L, 5L, NA), .Label = c("Beginner 1", 
"Beginner 10", "Beginner 3 & 4", "Beginner 5", "Expert 1", "Expert 10", 
"Expert 2", "intermediate 1", "intermediate 7 & 8", "intermediate 8"
), class = "factor")