R通过查找字典替换列

Question

在这个问题中，我需要能够从数据框的列中查找值，不仅基于一个属性，而且基于更多属性和与字典进行比较的范围。（是的，这实际上是中一个故事的续集）

对于 R-known ppl 来说这应该是一个简单的问题，因为我提供了基本索引的工作解决方案，需要升级，可能很容易......但这对我来说很难，因为我正在处理学习 R.

从哪里开始：

当我确实想根据列 testcolnames 从（大）table df1 中替换缺失值时 default of (small) dictionary testdefs (row selected by making testdefs$LABMET_ID 等于来自 testcolnames 的列名），我使用此代码：

testcolnames=c("80","116") #...result of regexp on colnames(df1), originally much longer

df1[,testcolnames] <- lapply(testcolnames, function(x) { tmpcol<-df1[,x];
  tmpcol[is.na(tmpcol)] <- testdefs$default[match(x, testdefs$LABMET_ID)];
  tmpcol  })

去哪里：

现在 - 我需要升级此解决方案。 table testdefs 将有（下面的示例）多行相同的 LABMET_ID 不同之处仅在于新的两列称为lower 和 upper ... 需要作为变量 df1$rngvalue 的边界 select正在替换哪个值。

换句话说 - 不仅要将此解决方案升级到 select 来自 testdefs 的行（其中 testdefs$LABMET_ID 等于列名），但是从这些行到 select 这样一行，其中 df1$rngvalue 在 testdefs$ 的范围内lower 和 testdefs$upper（if none 存在，取最接近的范围 - 最低或最高，如果字典没有LABMET_ID，我们可以在原始数据中保留NA)。

一个例子：

testdefs

"LABMET_ID","lower","upper","default","notuse","notuse2"
30,0,54750,25,80,2            #..."many columns we dont care about"
46,0,54750,1.45,3.5,0.2
80,0,54750,0.03,0.1,0.01
116,0,30,0.09,0.5,0.01
116,31,365,0.135,0.7,0.01
116,366,5475,0.11,0.7,0.01
116,5476,54750,0.105,0.7,0.02

df1:

"rngvalue","80","116"
36,NA,NA
600000,NA,NA
367,5,NA
90,NA,6

转化为：

"rngvalue","80","116"
36,0.03,0.135                   #col80 is always replaced by 0.03
600000,0.03,0.105               #col116 needs to be decided on range, this value is bigger than everything in dictionary so take the last one
367,5,0.11                      #5 not replaced, but second column nicely looks up to 0.11
90,0.03,6                       #6 not replaced

Answer 1

由于间隔没有间隙，您可以使用findInterval。我会使用 plyr.

中的 dlply 将查找 table 更改为包含每个值的断点和默认值的列表

## Transform lookup table to a list with breaks for intervals
library(plyr)
lookup <- dlply(testdefs, .(LABMET_ID), function(x)
    list(breaks=c(rbind(x$lower, x$upper), x$upper[length(x$upper)])[c(T,F)],
         default=x$default))

所以，查找现在看起来像

lookup[["116"]]
# $breaks
# [1]     0    31   366  5476 54750
# 
# $default
# [1] 0.090 0.135 0.110 0.105

然后，您可以通过以下方式进行查找

testcolnames=c("80","116")

df1[,testcolnames] <- lapply(testcolnames, function(x) {
    tmpcol <- df1[,x]
    defaults <- with(lookup[[x]], {
        default[pmax(pmin(length(breaks)-1, findInterval(df1$rngvalue, breaks)), 1)]
    })
    tmpcol[is.na(tmpcol)] <- defaults[is.na(tmpcol)]
    tmpcol
})

#   rngvalue   80   116
# 1       36 0.03 0.135
# 2   600000 0.03 0.105
# 3      367 5.00 0.110
# 4       90 0.03 6.000

如果 rng 值超出范围，则 findInterval returns 值低于和高于中断数。这就是上面代码中 pmin 和 pmax 的原因。

R通过查找字典替换列

R replacing columns by lookup to dictionary

lookup

r

dataframe

na