R/使用矢量化检查 df 中是否存在列

Question

我定义了以下函数来检查数据框是否包含多个列，如果不包含则包含它们。

CheckFullCohorts <- function(df) {
  # Checks if year/cohort df contains all necessary columns 
  # Args:
  #  df: year/cohort df

  # Return:
  #  df: df, corrected if necessary 

  foo <- function(mydf, mystring) {
    if(!(mystring %in% names(mydf))) {
      mydf[mystring] <- 0
    }
    mydf
  }

  df <- foo(df, "age.16.20")
  df <- foo(df, "age.21.24")
  df <- foo(df, "age.25.49")
  df <- foo(df, "age.50.57")
  df <- foo(df, "age.58.65")
  df <- foo(df, "age.66.70")

  df
}

我会按如下方式使用此功能：

test <- data.frame(age.16.20 = rep("x", 5), lorem = rep("y", 5))

test <- CheckFullCohorts(test)

问题：如何通过使用列名向量来检查函数的硬编码部分（df <- foo(...）更灵活？

我试过：

CheckFullCohorts <- function(df, col.list) {
  # Checks if year/cohort df contains all necessary columns 
  # Args:
  #  df: year/cohort df
  #  col.list: named list of columns

  # Return:
  #  df: df, corrected if necessary 

  foo <- function(mydf, mystring) {
    if(!(mystring %in% names(mydf))) {
      mydf[mystring] <- 0
    }
    mydf
  }

  df <- sapply(df, foo, mystring = col.list) 

  df
}

...但我得到了错误的结果：

test <- data.frame(age.16.20 = rep("x", 5), lorem = rep("y", 5))
test <- CheckFullCohorts(test, c("age.16.20", "age.20.25"))

Warning messages:
1: In if (!(mystring %in% names(mydf))) { :
  the condition has length > 1 and only the first element will be used
2: In `[<-.factor`(`*tmp*`, mystring, value = 0) :
  invalid factor level, NA generated
3: In if (!(mystring %in% names(mydf))) { :
  the condition has length > 1 and only the first element will be used
4: In `[<-.factor`(`*tmp*`, mystring, value = 0) :
  invalid factor level, NA generated
> test
          age.16.20 lorem
          "x"       "y"  
          "x"       "y"  
          "x"       "y"  
          "x"       "y"  
          "x"       "y"  
age.16.20 NA        NA   
age.20.25 NA        NA

Answer 1

您可以轻松地将其矢量化：

test <- data.frame(age.16.20 = rep("x", 5), lorem = rep("y", 5))
musthaves <- c("age.16.20", "age.21.24", "age.25.49",
               "age.50.57", "age.58.65", "age.66.70")

test[musthaves[!(musthaves %in% names(test))]] <- 0
#  age.16.20 lorem age.21.24 age.25.49 age.50.57 age.58.65 age.66.70
#1         x     y         0         0         0         0         0
#2         x     y         0         0         0         0         0
#3         x     y         0         0         0         0         0
#4         x     y         0         0         0         0         0
#5         x     y         0         0         0         0         0

但是，通常 NA 值比 0 更合适。

R/使用矢量化检查 df 中是否存在列

R / using vectorization to check if columns in exist in a df

r

vectorization

sapply