根据变量条件更改列类型
change column type based on condition of variables
我有数据,这是其中的一小部分样本:
df <- structure(list(`d955` = c("1", "4", NA, NA),
`65c2` = c("6a08", NA, "6a08", "6a09")),
class = c("tbl_df", "tbl", "data.frame"),
row.names = c(NA, -4L), .Names = c("d955", "65c2"))
# A tibble: 4 x 2
# d955 `65c2`
# <chr> <chr>
# 1 1 6a08
# 2 4 <NA>
# 3 <NA> 6a08
# 4 <NA> 6a09
两列都是字符类型。我想将仅包含从 1 到 5 的数字的所有列的列类型更改为整数。我知道我可以手动选择列来执行此操作,但是因为列会不断更改,所以这不是一个令人满意的选项。
那么如何自动执行此操作?我一直在研究 dplyr
包中的 mutate_if
,但我不知道如何从 select 正确的列开始。
我一直在研究 str_detect
,这可能会奏效,但 str_detect(df, "[1234]")
之类的内容也会匹配 65c2
行中数字介于 1-4 之间的字符串。我一直在寻找 str_count
的解决方案,因为整数的计数始终为 1,但我没有找到基于 stringcount 条件的 select 列的好的解决方案...
所需的自动结果:
# A tibble: 4 x 2
# d955 `65c2`
# <int> <chr>
# 1 1 6a08
# 2 4 <NA>
# 3 <NA> 6a08
# 4 <NA> 6a09
来自 base R 的想法,
i1 <- colSums(sapply(df, function(i) i %in% c(NA, 1:5))) == nrow(df)
df[i1] <- lapply(df[i1], as.integer)
这给出了,
str(df)
Classes ‘tbl_df’, ‘tbl’ and 'data.frame': 4 obs. of 2 variables:
$ d955: int 1 4 NA NA
$ 65c2: chr "6a08" NA "6a08" "6a09"
你也可以把它做成一个函数,
my_conversion <- function(df){
i1 <- colSums(sapply(df, function(i) i %in% c(NA, 1:5))) == nrow(df)
df[i1] <- lapply(df[i1], as.integer)
return(df)
}
使用data.table
library(data.table)
setDT(df)
# get indices of all the character columns
# (i.e. we can skip numeric/other columns)
char_cols = sapply(df, is.character)
# := is the assignment operator in data.table --
# since data.table is built for efficiency,
# this differs from base R or dplyr assignment
# since assignment with := is _by reference_,
# meaning no copies are created. there are other
# advantages of :=, like simple assignment
# by group -- see the intro vignettes
#.SD is a reflexive reference -- if .SDcols
# is unspecified, it simply refers to your
# data.table itself -- df[ , .SD] is the same as df.
# .SDcols is used to restrict which columns are
# included in this Subset of the Data -- here,
# we only include character columns.
#Finally, by lapply-ing .SD, we essentially loop
# over the specified columns to apply our
# custom-tailored function
df[ (char_cols) := lapply(.SD, function(x) {
if (any(grepl('[^1-5]', x))) x
else as.integer(x)
}, .SDcols = char_cols]
希望转换逻辑清晰;可以根据需要详细说明。
请参阅 Getting Started wiki 了解入门知识和大量其他资源,让自己适应 data.table
的基本知识。
使用 dplyr
包中的 mutate_if
的解决方案。我们需要为此任务定义一个谓词函数 (is_one_five_only
)。
library(dplyr)
# Design a function to determine if elements from one vector are all 1 to 5
# Notice that if the entire column is NA, it will report FALSE
is_one_five_only <- function(x){
if (all(is.na(x))){
return(FALSE)
} else {
x2 <- x[!is.na(x)]
return(all(x2 %in% 1:5))
}
}
# Apply is_one_five_only as the predicate function in mutate_if
df2 <- df %>% mutate_if(is_one_five_only, as.integer)
df2
# # A tibble: 4 x 2
# d955 `65c2`
# <int> <chr>
# 1 1 6a08
# 2 4 <NA>
# 3 NA 6a08
# 4 NA 6a09
我有数据,这是其中的一小部分样本:
df <- structure(list(`d955` = c("1", "4", NA, NA),
`65c2` = c("6a08", NA, "6a08", "6a09")),
class = c("tbl_df", "tbl", "data.frame"),
row.names = c(NA, -4L), .Names = c("d955", "65c2"))
# A tibble: 4 x 2
# d955 `65c2`
# <chr> <chr>
# 1 1 6a08
# 2 4 <NA>
# 3 <NA> 6a08
# 4 <NA> 6a09
两列都是字符类型。我想将仅包含从 1 到 5 的数字的所有列的列类型更改为整数。我知道我可以手动选择列来执行此操作,但是因为列会不断更改,所以这不是一个令人满意的选项。
那么如何自动执行此操作?我一直在研究 dplyr
包中的 mutate_if
,但我不知道如何从 select 正确的列开始。
我一直在研究 str_detect
,这可能会奏效,但 str_detect(df, "[1234]")
之类的内容也会匹配 65c2
行中数字介于 1-4 之间的字符串。我一直在寻找 str_count
的解决方案,因为整数的计数始终为 1,但我没有找到基于 stringcount 条件的 select 列的好的解决方案...
所需的自动结果:
# A tibble: 4 x 2
# d955 `65c2`
# <int> <chr>
# 1 1 6a08
# 2 4 <NA>
# 3 <NA> 6a08
# 4 <NA> 6a09
来自 base R 的想法,
i1 <- colSums(sapply(df, function(i) i %in% c(NA, 1:5))) == nrow(df)
df[i1] <- lapply(df[i1], as.integer)
这给出了,
str(df) Classes ‘tbl_df’, ‘tbl’ and 'data.frame': 4 obs. of 2 variables: $ d955: int 1 4 NA NA $ 65c2: chr "6a08" NA "6a08" "6a09"
你也可以把它做成一个函数,
my_conversion <- function(df){
i1 <- colSums(sapply(df, function(i) i %in% c(NA, 1:5))) == nrow(df)
df[i1] <- lapply(df[i1], as.integer)
return(df)
}
使用data.table
library(data.table)
setDT(df)
# get indices of all the character columns
# (i.e. we can skip numeric/other columns)
char_cols = sapply(df, is.character)
# := is the assignment operator in data.table --
# since data.table is built for efficiency,
# this differs from base R or dplyr assignment
# since assignment with := is _by reference_,
# meaning no copies are created. there are other
# advantages of :=, like simple assignment
# by group -- see the intro vignettes
#.SD is a reflexive reference -- if .SDcols
# is unspecified, it simply refers to your
# data.table itself -- df[ , .SD] is the same as df.
# .SDcols is used to restrict which columns are
# included in this Subset of the Data -- here,
# we only include character columns.
#Finally, by lapply-ing .SD, we essentially loop
# over the specified columns to apply our
# custom-tailored function
df[ (char_cols) := lapply(.SD, function(x) {
if (any(grepl('[^1-5]', x))) x
else as.integer(x)
}, .SDcols = char_cols]
希望转换逻辑清晰;可以根据需要详细说明。
请参阅 Getting Started wiki 了解入门知识和大量其他资源,让自己适应 data.table
的基本知识。
使用 dplyr
包中的 mutate_if
的解决方案。我们需要为此任务定义一个谓词函数 (is_one_five_only
)。
library(dplyr)
# Design a function to determine if elements from one vector are all 1 to 5
# Notice that if the entire column is NA, it will report FALSE
is_one_five_only <- function(x){
if (all(is.na(x))){
return(FALSE)
} else {
x2 <- x[!is.na(x)]
return(all(x2 %in% 1:5))
}
}
# Apply is_one_five_only as the predicate function in mutate_if
df2 <- df %>% mutate_if(is_one_five_only, as.integer)
df2
# # A tibble: 4 x 2
# d955 `65c2`
# <int> <chr>
# 1 1 6a08
# 2 4 <NA>
# 3 NA 6a08
# 4 NA 6a09