R/使用矢量化检查 df 中是否存在列
R / using vectorization to check if columns in exist in a df
我定义了以下函数来检查数据框是否包含多个列,如果不包含则包含它们。
CheckFullCohorts <- function(df) {
# Checks if year/cohort df contains all necessary columns
# Args:
# df: year/cohort df
# Return:
# df: df, corrected if necessary
foo <- function(mydf, mystring) {
if(!(mystring %in% names(mydf))) {
mydf[mystring] <- 0
}
mydf
}
df <- foo(df, "age.16.20")
df <- foo(df, "age.21.24")
df <- foo(df, "age.25.49")
df <- foo(df, "age.50.57")
df <- foo(df, "age.58.65")
df <- foo(df, "age.66.70")
df
}
我会按如下方式使用此功能:
test <- data.frame(age.16.20 = rep("x", 5), lorem = rep("y", 5))
test <- CheckFullCohorts(test)
问题:如何通过使用列名向量来检查函数的硬编码部分(df <- foo(...
)更灵活?
我试过:
CheckFullCohorts <- function(df, col.list) {
# Checks if year/cohort df contains all necessary columns
# Args:
# df: year/cohort df
# col.list: named list of columns
# Return:
# df: df, corrected if necessary
foo <- function(mydf, mystring) {
if(!(mystring %in% names(mydf))) {
mydf[mystring] <- 0
}
mydf
}
df <- sapply(df, foo, mystring = col.list)
df
}
...但我得到了错误的结果:
test <- data.frame(age.16.20 = rep("x", 5), lorem = rep("y", 5))
test <- CheckFullCohorts(test, c("age.16.20", "age.20.25"))
Warning messages:
1: In if (!(mystring %in% names(mydf))) { :
the condition has length > 1 and only the first element will be used
2: In `[<-.factor`(`*tmp*`, mystring, value = 0) :
invalid factor level, NA generated
3: In if (!(mystring %in% names(mydf))) { :
the condition has length > 1 and only the first element will be used
4: In `[<-.factor`(`*tmp*`, mystring, value = 0) :
invalid factor level, NA generated
> test
age.16.20 lorem
"x" "y"
"x" "y"
"x" "y"
"x" "y"
"x" "y"
age.16.20 NA NA
age.20.25 NA NA
您可以轻松地将其矢量化:
test <- data.frame(age.16.20 = rep("x", 5), lorem = rep("y", 5))
musthaves <- c("age.16.20", "age.21.24", "age.25.49",
"age.50.57", "age.58.65", "age.66.70")
test[musthaves[!(musthaves %in% names(test))]] <- 0
# age.16.20 lorem age.21.24 age.25.49 age.50.57 age.58.65 age.66.70
#1 x y 0 0 0 0 0
#2 x y 0 0 0 0 0
#3 x y 0 0 0 0 0
#4 x y 0 0 0 0 0
#5 x y 0 0 0 0 0
但是,通常 NA
值比 0
更合适。
我定义了以下函数来检查数据框是否包含多个列,如果不包含则包含它们。
CheckFullCohorts <- function(df) {
# Checks if year/cohort df contains all necessary columns
# Args:
# df: year/cohort df
# Return:
# df: df, corrected if necessary
foo <- function(mydf, mystring) {
if(!(mystring %in% names(mydf))) {
mydf[mystring] <- 0
}
mydf
}
df <- foo(df, "age.16.20")
df <- foo(df, "age.21.24")
df <- foo(df, "age.25.49")
df <- foo(df, "age.50.57")
df <- foo(df, "age.58.65")
df <- foo(df, "age.66.70")
df
}
我会按如下方式使用此功能:
test <- data.frame(age.16.20 = rep("x", 5), lorem = rep("y", 5))
test <- CheckFullCohorts(test)
问题:如何通过使用列名向量来检查函数的硬编码部分(df <- foo(...
)更灵活?
我试过:
CheckFullCohorts <- function(df, col.list) {
# Checks if year/cohort df contains all necessary columns
# Args:
# df: year/cohort df
# col.list: named list of columns
# Return:
# df: df, corrected if necessary
foo <- function(mydf, mystring) {
if(!(mystring %in% names(mydf))) {
mydf[mystring] <- 0
}
mydf
}
df <- sapply(df, foo, mystring = col.list)
df
}
...但我得到了错误的结果:
test <- data.frame(age.16.20 = rep("x", 5), lorem = rep("y", 5))
test <- CheckFullCohorts(test, c("age.16.20", "age.20.25"))
Warning messages:
1: In if (!(mystring %in% names(mydf))) { :
the condition has length > 1 and only the first element will be used
2: In `[<-.factor`(`*tmp*`, mystring, value = 0) :
invalid factor level, NA generated
3: In if (!(mystring %in% names(mydf))) { :
the condition has length > 1 and only the first element will be used
4: In `[<-.factor`(`*tmp*`, mystring, value = 0) :
invalid factor level, NA generated
> test
age.16.20 lorem
"x" "y"
"x" "y"
"x" "y"
"x" "y"
"x" "y"
age.16.20 NA NA
age.20.25 NA NA
您可以轻松地将其矢量化:
test <- data.frame(age.16.20 = rep("x", 5), lorem = rep("y", 5))
musthaves <- c("age.16.20", "age.21.24", "age.25.49",
"age.50.57", "age.58.65", "age.66.70")
test[musthaves[!(musthaves %in% names(test))]] <- 0
# age.16.20 lorem age.21.24 age.25.49 age.50.57 age.58.65 age.66.70
#1 x y 0 0 0 0 0
#2 x y 0 0 0 0 0
#3 x y 0 0 0 0 0
#4 x y 0 0 0 0 0
#5 x y 0 0 0 0 0
但是,通常 NA
值比 0
更合适。