使用 data.table 连接基于元数据避免 for 循环转换数据
Transform data based on metadata avoiding for-loops using data.table joins
问题: 我有以下元数据 data.table 对象。基于此,我想将实际 data.table dt
的 extension
和 start_date
列转换为日期列。我有一个解决方案,我迭代 meta_dt
的行。因为我想避免 for 循环,你能想到一个聪明的 data.table
连接吗?
library(data.table)
meta_dt <- data.table(
col_n = c("id", "description", "extension", "start_date"),
type = c("character", "character", "date", "date"),
form = c(NA, NA, "%Y-%m-%d", "%Y-%m-%d")
)
dt <- data.table(
id = c(1, 2, 3, 4),
description = c("ab", "ac", "ad", "ae"),
extension = c("2020-01-01", "2020-12-31", "2020-05-01", "2020-01-04"),
start_date = c("2020-09-01", "2020-11-31", "2020-08-19", "2020-03-14")
)
预期结果:预期结果的结构应如下所示(即仅转换元数据中指定为日期的列,其他列不受影响):
Classes ‘data.table’ and 'data.frame': 4 obs. of 4 variables:
$ id : num 1 2 3 4
$ description: chr "ab" "ac" "ad" "ae"
$ extension : Date, format: "2020-01-01" "2020-12-31" ...
$ start_date : Date, format: "2020-09-01" "2020-11-30" ...
这是 set() 的一个选项:
for (i in seq_along(dt)) {
correct_type <- meta_dt[col_n == names(dt)[i], type]
if (!inherits(dt[[i]], correct_type)) {
if (correct_type %in% c("date", "Date")) {
format <- meta_dt[col_n == names(dt)[i], form]
set(dt, j = i, value = as.Date(dt[[i]], format))
} else {
set(dt, j = i, value = as(dt[[i]], correct_type))
}
}
}
> str(dt)
Classes ‘data.table’ and 'data.frame': 4 obs. of 4 variables:
$ id : chr "1" "2" "3" "4"
$ description: chr "ab" "ac" "ad" "ae"
$ extension : Date, format: "2020-01-01" "2020-12-31" "2020-05-01" "2020-01-04"
$ start_date : Date, format: "2020-09-01" NA "2020-08-19" "2020-03-14"
注意
- 日期对象的正确 class 名称以大写开头
Date
2020-11-31
不是公历中的有效日期,因此被转换为 NA
。
问题: 我有以下元数据 data.table 对象。基于此,我想将实际 data.table dt
的 extension
和 start_date
列转换为日期列。我有一个解决方案,我迭代 meta_dt
的行。因为我想避免 for 循环,你能想到一个聪明的 data.table
连接吗?
library(data.table)
meta_dt <- data.table(
col_n = c("id", "description", "extension", "start_date"),
type = c("character", "character", "date", "date"),
form = c(NA, NA, "%Y-%m-%d", "%Y-%m-%d")
)
dt <- data.table(
id = c(1, 2, 3, 4),
description = c("ab", "ac", "ad", "ae"),
extension = c("2020-01-01", "2020-12-31", "2020-05-01", "2020-01-04"),
start_date = c("2020-09-01", "2020-11-31", "2020-08-19", "2020-03-14")
)
预期结果:预期结果的结构应如下所示(即仅转换元数据中指定为日期的列,其他列不受影响):
Classes ‘data.table’ and 'data.frame': 4 obs. of 4 variables:
$ id : num 1 2 3 4
$ description: chr "ab" "ac" "ad" "ae"
$ extension : Date, format: "2020-01-01" "2020-12-31" ...
$ start_date : Date, format: "2020-09-01" "2020-11-30" ...
这是 set() 的一个选项:
for (i in seq_along(dt)) {
correct_type <- meta_dt[col_n == names(dt)[i], type]
if (!inherits(dt[[i]], correct_type)) {
if (correct_type %in% c("date", "Date")) {
format <- meta_dt[col_n == names(dt)[i], form]
set(dt, j = i, value = as.Date(dt[[i]], format))
} else {
set(dt, j = i, value = as(dt[[i]], correct_type))
}
}
}
> str(dt)
Classes ‘data.table’ and 'data.frame': 4 obs. of 4 variables:
$ id : chr "1" "2" "3" "4"
$ description: chr "ab" "ac" "ad" "ae"
$ extension : Date, format: "2020-01-01" "2020-12-31" "2020-05-01" "2020-01-04"
$ start_date : Date, format: "2020-09-01" NA "2020-08-19" "2020-03-14"
注意
- 日期对象的正确 class 名称以大写开头
Date
2020-11-31
不是公历中的有效日期,因此被转换为NA
。