如何自动向 R 中的大数据框中的变量添加因子

How to automate adding factors to variables in large data frame in R

我在 R 中有一个大型数据框,其中有 200 多个主要是字符变量,我想为其添加因子。我已经在单独的数据框中准备了所有级别和标签。对于某个变量Var1,对应的水平和标签是Var1_vVar1_b,例如对于变量Gender,水平和标签被命名为Gender_vGender_l.

这是我的数据示例:

df <- data.frame (Gender = c("2","2","1","2"),
                  AgeG = c("3","1","4","2"))

fct <- data.frame (Gender_v  = c("1", "2"),
                  Gender_b = c("Male", "Female"),
                  AgeG_v = c("1","2","3","4"),
                  AgeG_b = c("<25","25-60","65-80",">80"))

df$Gender <- factor(df$Gender, levels = fct$Gender_v, labels = fct$Gender_b, exclude = NULL)
df$AgeG <- factor(df$AgeG, levels = fct$AgeG_v, labels = fct$AgeG_b, exclude = NULL)

有没有办法使过程自动化,以便将因素(水平和标签)应用于相应的变量,而无需我单独处理每个变量? 我认为这是通过 pmap.

的函数完成的

我的目标是尽量减少此过程所需的工作量。有没有更好的方法来准备标签和级别?

非常感谢您的帮助。

我通过对您的代码进行简单的重构解决了这个问题,实现了一个循环的自动化。添加的数据越多,花费的时间就越多。我相信这个 fct[[paste0(names(df[i]),"_v")]] 可以重构为一个小函数,看起来更好

> df <- data.frame (Gender = c("2","2","1","2"),
+                   AgeG = c("3","1","4","2"))
> 
> fct <- data.frame (Gender_v  = c("1", "2"),
+                    Gender_b = c("Male", "Female"),
+                    AgeG_v = c("1","2","3","4"),
+                    AgeG_b = c("<25","25-60","65-80",">80"))
> 
> for(i in 1:ncol(df)){
+   
+   le <- fct[[paste0(names(df[i]),"_v")]]
+   
+   la <- fct[[paste0(names(df[i]),"_b")]]
+   
+   df[,i] <- factor(df[,i],levels = le ,labels = la,exclude = NULL)
+   
+ }
> 
> df
  Gender  AgeG
1 Female 65-80
2 Female   <25
3   Male   >80
4 Female 25-60
>

编辑:这是添加的 if 条件


> df <- data.frame (Gender_f = c("2","2","1","2"),
+                             AgeG_f = c("3","1","4","2"),
+                   AgeN = c(70,15,96,30))
> 
> fct <- data.frame (Gender_v  = c("1", "2"),
+                                   Gender_b = c("Male", "Female"),
+                                   AgeG_v = c("1","2","3","4"),
+                                  AgeG_b = c("<25","25-60","65-80",">80"))
> 
> for(i in 1:ncol(df)){
+ 
+   if(endsWith(names(df[i]),"_f")){
+     
+     name <- str_remove(names(df[i]),"_f")
+   
+     le <- fct[[paste0(name,"_v")]]
+    
+     la <- fct[[paste0(name,"_b")]]
+      
+     df[,i] <- factor(df[,i],levels = le ,labels = la,exclude = NULL)
+   
+   }
+      
+ }
> 
> df
  Gender_f AgeG_f AgeN
1   Female  65-80   70
2   Female    <25   15
3     Male    >80   96
4   Female  25-60   30
> 

数据框并不是真正适合存储数据的数据结构 因素水平定义:没有理由期望所有因素都有 等量的水平。相反,我只是使用一个简单的列表,并存储 将信息级别更紧凑地命名为向量,如下所示:

df <- data.frame(
  Gender = c("2", "2", "1", "2"),
  AgeG = c("3", "1", "4", "2")
)

value_labels <- list(
  Gender = c("Male" = 1, "Female" = 2),
  AgeG = c("<25" = 1, "25-60" = 2, "65-80" = 3, ">80" = 4)
)

然后你可以创建一个函数,使用该数据结构来生成因子 在数据框中:

make_factors <- function(data, value_labels) {
  for (var in names(value_labels)) {
    if (var %in% colnames(data)) {
      vl <- value_labels[[var]]
      data[[var]] <- factor(
        data[[var]],
        levels = unname(vl),
        labels = names(vl)
      )
    }
  }
  data
}

make_factors(df, value_labels)
#>   Gender  AgeG
#> 1 Female 65-80
#> 2 Female   <25
#> 3   Male   >80
#> 4 Female 25-60