R通过查找字典替换列
R replacing columns by lookup to dictionary
在这个问题中,我需要能够从数据框的列中查找值,不仅基于一个属性,而且基于更多属性和与字典进行比较的范围。
(是的,这实际上是中一个故事的续集)
对于 R-known ppl 来说这应该是一个简单的问题,因为我提供了基本索引的工作解决方案,需要升级,可能很容易......但这对我来说很难,因为我正在处理学习 R.
从哪里开始:
当我确实想根据列 testcolnames 从(大)table df1 中替换缺失值时 default of (small) dictionary testdefs (row selected by making testdefs$LABMET_ID 等于来自 testcolnames 的列名),我使用此代码:
testcolnames=c("80","116") #...result of regexp on colnames(df1), originally much longer
df1[,testcolnames] <- lapply(testcolnames, function(x) { tmpcol<-df1[,x];
tmpcol[is.na(tmpcol)] <- testdefs$default[match(x, testdefs$LABMET_ID)];
tmpcol })
去哪里:
现在 - 我需要升级此解决方案。 table testdefs 将有(下面的示例)多行相同的 LABMET_ID 不同之处仅在于新的两列称为lower 和 upper ... 需要作为变量 df1$rngvalue 的边界 select正在替换哪个值。
换句话说 - 不仅要将此解决方案升级到 select 来自 testdefs 的行(其中 testdefs$LABMET_ID 等于列名),但是从这些行到 select 这样一行,其中 df1$rngvalue 在 testdefs$ 的范围内lower 和 testdefs$upper(if none 存在,取最接近的范围 - 最低或最高,如果字典没有LABMET_ID,我们可以在原始数据中保留NA)。
一个例子:
testdefs
"LABMET_ID","lower","upper","default","notuse","notuse2"
30,0,54750,25,80,2 #..."many columns we dont care about"
46,0,54750,1.45,3.5,0.2
80,0,54750,0.03,0.1,0.01
116,0,30,0.09,0.5,0.01
116,31,365,0.135,0.7,0.01
116,366,5475,0.11,0.7,0.01
116,5476,54750,0.105,0.7,0.02
df1:
"rngvalue","80","116"
36,NA,NA
600000,NA,NA
367,5,NA
90,NA,6
转化为:
"rngvalue","80","116"
36,0.03,0.135 #col80 is always replaced by 0.03
600000,0.03,0.105 #col116 needs to be decided on range, this value is bigger than everything in dictionary so take the last one
367,5,0.11 #5 not replaced, but second column nicely looks up to 0.11
90,0.03,6 #6 not replaced
由于间隔没有间隙,您可以使用findInterval
。我会使用 plyr
.
中的 dlply
将查找 table 更改为包含每个值的断点和默认值的列表
## Transform lookup table to a list with breaks for intervals
library(plyr)
lookup <- dlply(testdefs, .(LABMET_ID), function(x)
list(breaks=c(rbind(x$lower, x$upper), x$upper[length(x$upper)])[c(T,F)],
default=x$default))
所以,查找现在看起来像
lookup[["116"]]
# $breaks
# [1] 0 31 366 5476 54750
#
# $default
# [1] 0.090 0.135 0.110 0.105
然后,您可以通过以下方式进行查找
testcolnames=c("80","116")
df1[,testcolnames] <- lapply(testcolnames, function(x) {
tmpcol <- df1[,x]
defaults <- with(lookup[[x]], {
default[pmax(pmin(length(breaks)-1, findInterval(df1$rngvalue, breaks)), 1)]
})
tmpcol[is.na(tmpcol)] <- defaults[is.na(tmpcol)]
tmpcol
})
# rngvalue 80 116
# 1 36 0.03 0.135
# 2 600000 0.03 0.105
# 3 367 5.00 0.110
# 4 90 0.03 6.000
如果 rng 值超出范围,则 findInterval
returns 值低于和高于中断数。这就是上面代码中 pmin
和 pmax
的原因。
在这个问题中,我需要能够从数据框的列中查找值,不仅基于一个属性,而且基于更多属性和与字典进行比较的范围。
(是的,这实际上是
对于 R-known ppl 来说这应该是一个简单的问题,因为我提供了基本索引的工作解决方案,需要升级,可能很容易......但这对我来说很难,因为我正在处理学习 R.
从哪里开始:
当我确实想根据列 testcolnames 从(大)table df1 中替换缺失值时 default of (small) dictionary testdefs (row selected by making testdefs$LABMET_ID 等于来自 testcolnames 的列名),我使用此代码:
testcolnames=c("80","116") #...result of regexp on colnames(df1), originally much longer
df1[,testcolnames] <- lapply(testcolnames, function(x) { tmpcol<-df1[,x];
tmpcol[is.na(tmpcol)] <- testdefs$default[match(x, testdefs$LABMET_ID)];
tmpcol })
去哪里:
现在 - 我需要升级此解决方案。 table testdefs 将有(下面的示例)多行相同的 LABMET_ID 不同之处仅在于新的两列称为lower 和 upper ... 需要作为变量 df1$rngvalue 的边界 select正在替换哪个值。
换句话说 - 不仅要将此解决方案升级到 select 来自 testdefs 的行(其中 testdefs$LABMET_ID 等于列名),但是从这些行到 select 这样一行,其中 df1$rngvalue 在 testdefs$ 的范围内lower 和 testdefs$upper(if none 存在,取最接近的范围 - 最低或最高,如果字典没有LABMET_ID,我们可以在原始数据中保留NA)。
一个例子:
testdefs
"LABMET_ID","lower","upper","default","notuse","notuse2"
30,0,54750,25,80,2 #..."many columns we dont care about"
46,0,54750,1.45,3.5,0.2
80,0,54750,0.03,0.1,0.01
116,0,30,0.09,0.5,0.01
116,31,365,0.135,0.7,0.01
116,366,5475,0.11,0.7,0.01
116,5476,54750,0.105,0.7,0.02
df1:
"rngvalue","80","116"
36,NA,NA
600000,NA,NA
367,5,NA
90,NA,6
转化为:
"rngvalue","80","116"
36,0.03,0.135 #col80 is always replaced by 0.03
600000,0.03,0.105 #col116 needs to be decided on range, this value is bigger than everything in dictionary so take the last one
367,5,0.11 #5 not replaced, but second column nicely looks up to 0.11
90,0.03,6 #6 not replaced
由于间隔没有间隙,您可以使用findInterval
。我会使用 plyr
.
dlply
将查找 table 更改为包含每个值的断点和默认值的列表
## Transform lookup table to a list with breaks for intervals
library(plyr)
lookup <- dlply(testdefs, .(LABMET_ID), function(x)
list(breaks=c(rbind(x$lower, x$upper), x$upper[length(x$upper)])[c(T,F)],
default=x$default))
所以,查找现在看起来像
lookup[["116"]]
# $breaks
# [1] 0 31 366 5476 54750
#
# $default
# [1] 0.090 0.135 0.110 0.105
然后,您可以通过以下方式进行查找
testcolnames=c("80","116")
df1[,testcolnames] <- lapply(testcolnames, function(x) {
tmpcol <- df1[,x]
defaults <- with(lookup[[x]], {
default[pmax(pmin(length(breaks)-1, findInterval(df1$rngvalue, breaks)), 1)]
})
tmpcol[is.na(tmpcol)] <- defaults[is.na(tmpcol)]
tmpcol
})
# rngvalue 80 116
# 1 36 0.03 0.135
# 2 600000 0.03 0.105
# 3 367 5.00 0.110
# 4 90 0.03 6.000
如果 rng 值超出范围,则 findInterval
returns 值低于和高于中断数。这就是上面代码中 pmin
和 pmax
的原因。