迭代数据框中的列以替换数据框列表中匹配数据的值

Question

我有兴趣使用 apply/sapply 或 Map 构建一个函数，该函数将迭代 dta 中的可用列 并将每列中的值替换为无名数据框列表中可用数据框的匹配值，列表项索引对应于 dta[=67= 的列号]数据框。

例子

给定对象：

set.seed(1) size <- 20 # Data set dta <- data.frame( unitA = sample(LETTERS[1:4], size = size, replace = TRUE), unitB = sample(letters[16:20], size = size, replace = TRUE), unitC = sample(month.abb[1:4], size = size, replace = TRUE), someValue = sample(1:1e6, size = size, replace = TRUE) ) # Meta data lstMeta <- list( # Unit A definitions data.frame( V1 = c("A", "B", "D"), V2 = c("Letter A", "Letter B", "Letter D") ), # Unit B definitions data.frame( V1 = c("t", "q"), V2 = c("small t", "small q") ), # Unit C definitions data.frame( V1 = c("Mar", "Jan"), V2 = c("March", "January") ) )

想要的结果

当应用于dta时，函数应该return一个data.frame对应于下面的摘录：

unitA unitB unitC someValue Letter B small t Apr 912876 Letter B small q March 293604 C s Apr 459066 Letter D p March 332395 Letter A small q March 650871 Letter D small q Apr 258017 Letter D p January 478546 C small q Feb 766311 C small t March 84247 Letter A small q March 875322 Letter A r Feb 339073 Letter A r Ap 839441 C r Feb 346684 Letter B p January 333775 Letter D small t January 476352 (...)

现有方法

replaceLbls <- function(dataSet, lstDict) { sapply(seq_along(dataSet), function(i) { # Take corresponding metadata data frame dtaDict <- lstDict[[i]] # Replace values in selected column # Where matches on V1 push corrsponding values from V2 dataSet[,i][match(dataSet[,i], dtaDict[,1])] <- dtaDict[,2][match(dtaDict[,1], dataSet[,i])] }) } # Testing ----------------------------------------------------------------- replaceLbls(dataSet = dta, lstDict = lstMeta)

当然，上面提出的方法不起作用，因为它会尝试在作业中使用 NA；但它总结了我想要实现的目标：

Error in x[...] <- m : NAs are not allowed in subscripted assignments In addition: Warning message: In [<-.factor(*tmp*, match(dataSet[, i], dtaDict[, 1]), value = c(NA, : invalid factor level, NA generated

补充说明

源数据集

数据的主要特征是：

该列表是无名的，因此必须按项目编号而不是名称来进行子集化

项目编号对应列编号

数据帧列表中可用的元数据数据帧与数据
中可用的 unit 列之间没有完全匹配
someValue 列也应迭代，因为它可能包含应替换的标签

解决方案

我对基于 dplyr/data.table/sqldf 的解决方案不感兴趣。

我对嵌套 for-loops
不感兴趣

Answer 1

我有一个不使用 for 循环或其他包的 hacky 解决方案。我需要将 factors 转换为 characters 才能正常工作，但您可以改进我的解决方案。

该解决方案的工作原理是仅匹配在您的 lstMeta 中找到的值，方法是创建一个包含找到匹配项的索引向量。我还使用了 <<- 运算符。如果你 R 比我好，你可能会改进这个。

set.seed(1)
size <- 20

# Data set
dta <-
  data.frame(
    unitA = sample(LETTERS[1:4], size = size, replace = TRUE),
    unitB = sample(letters[16:20], size = size, replace = TRUE),
    unitC = sample(month.abb[1:4], size = size, replace = TRUE),
    someValue = sample(1:1e6, size = size, replace = TRUE),
    stringsAsFactors = F
  )

# Meta data
lstMeta <- list(
  # Unit A definitions
  data.frame(
    V1 = c("A", "B", "D"),
    V2 = c("Letter A", "Letter B", "Letter D"),
    stringsAsFactors = F
  ),
  # Unit B definitions
  data.frame(
    V1 = c("t", "q"),
    V2 = c("small t", "small q"),
    stringsAsFactors = F
  ),
  # Unit C definitions
  data.frame(
    V1 = c("Mar", "Jan"),
    V2 = c("March", "January"),
    stringsAsFactors = F
  )
)

replaceLbls <- function(dataSet, lstDict) {
  sapply(1:3, function(i) {
    # Take corresponding metadata data frame
    dtaDict <- lstDict[[i]]

    # Replace values in selected column
    # Where matches on V1 push corrsponding values from V2
    myUniques <- which(dataSet[,i] %in% dtaDict[,1])

    dataSet[myUniques,i]<<- dtaDict[,2][match(dataSet[myUniques,i],dtaDict[,1])]  
  })
  return(dataSet)
}

# Testing -----------------------------------------------------------------

replaceLbls(dataSet = dta, lstDict = lstMeta)

Answer 2

以下方法适用于示例数据：

replaceLbls <- function(dataSet, lstDict) {
  dataSet[seq_along(lstDict)] <- Map(function(x, lst) {
    x <- as.character(x)
    idx <- match(x, as.character(lst$V1))
    replace(x, !is.na(idx), as.character(lst$V2)[na.omit(idx)])
  }, dataSet[seq_along(lstDict)], lstDict)
  dataSet
}


head(replaceLbls(dta, lstMeta))
#      unitA   unitB unitC someValue
# 1 Letter B small t   Apr    912876
# 2 Letter B small q March    293604
# 3        C       s   Apr    459066
# 4 Letter D       p March    332395
# 5 Letter A small q March    650871
# 6 Letter D small q   Apr    258017

这假定您要将更改应用于与元列表一样长的数据的第一个 X 列。您可能希望包括一个额外的步骤来转换回因子，因为这种方法会将调整后的列转换为字符 class.

关于因子的另一评论：您可以通过仅处理任何因子变量的水平而不是整列来提高性能。一般过程类似，但需要更多步骤来检查 classes 等

Answer 3

你也可以试试这个：

mapr<-function(t,meta){
    ind<-match(t,meta$V1)
    if(!is.na(ind)){return(meta$V2[ind])}
    else{return(t)}}

然后使用 sapply:

dta<-as.data.frame(cbind(sapply(1:3,function(t,df,meta){sapply(df[,t],mapr,lstMeta[[t]])},dta,lstMeta,simplify = T),dta[,4]))

Answer 4

几个 mapply 就可以完成这项工作

f1 <- function(df, lst){
  d1 <- setNames(data.frame(mapply(function(x, y) x$V2[match(y, x$V1)], lst, df[1:3]), 
                            df$someValue, stringsAsFactors = FALSE), 
                 names(df))
  as.data.frame(mapply(function(x, y) replace(x, is.na(x), y[is.na(x)]), d1, df))

}

迭代数据框中的列以替换数据框列表中匹配数据的值

Iterating over columns in a data frame in order to replace values from matching data in list of data frames

replace

r

apply

dataframe

sapply

例子

想要的结果

现有方法

补充说明

源数据集

解决方案