根据其他向量在列中查找匹配值

Question

我有一个这样的数据框和向量：

df1 <- data.frame(orig = c(1,1,1,2,2,2,2,3,3),
                  proxy = c(1,43,65,2,44,45,46,3,55),
                  dist = c(0, 100,101, 10, 1000, 5000, 5001,0,3))

v <- c(1,45:100)

我现在想要的是：

对于 df1$orig 中的每个唯一值（这里为简单起见，它是一个数字，但也可以是字符），如果相同的 orig 值在 v 中不可用，找到具有最低 dist.

的最佳代理

在此示例中，df1$orig 中的第一个值是 1，并且此值在 v 中也可用，因此我们采用它。 df$orig 中的第二个唯一值是 2，这在 v 中不可用。在这种情况下，具有最低 dist 的最佳代理是 44，但它也不在 v 中。下一个最好的是 45，这个值在 v 中，所以我们接受它。 df1$orig中的第三个唯一值是3，v中没有3。这里最好的代理是55.

解是 c(1,45,55)

请注意，proxy 中每个 orig 的第一个值是 orig 值。 dist 在此处排序，但不一定总是如此。

Answer 1

这可以通过 {dplyr} 通过几个步骤完成：保留 v 中的代理，按 dist 排序并为每个 orig 选择第一个：

library(dplyr)

df1 %>% 
  filter(proxy %in% v) %>% 
  arrange(dist) %>% 
  group_by(orig) %>% 
  slice(1)
#> # A tibble: 3 x 3
#> # Groups:   orig [3]
#>    orig proxy  dist
#>   <dbl> <dbl> <dbl>
#> 1     1     1     0
#> 2     2    45  5000
#> 3     3    55     3

^{由 reprex package (v0.3.0)}

创建于 2019-09-11

Answer 2

如果您在 dplyr 解决方案旁边，也对 base 解决方案感兴趣。

拳头减少到proxy和v之间匹配的那些，然后order被orig和dist匹配，然后取那些匹配的不是 duplicated.

tt <- df1[df1$proxy %in% v,]
tt <- tt[order(tt$orig, tt$dist),]
tt[!duplicated(tt$orig),]
#  orig proxy dist
#1    1     1    0
#6    2    45 5000
#9    3    55    3

或者如果在 proxy 和 v 之间不匹配时您丢失了一些 orig，您可以使用：

tt <- df1[df1$proxy %in% v,]
tt <- tt[order(tt$orig, tt$dist),]
tt <- tt[!duplicated(tt$orig),c("orig", "proxy")]
tt$proxy[match(unique(df1$orig), tt$orig)]
#[1]  1 45 55

根据其他向量在列中查找匹配值

find matching value in a column based on other vector

sorting

r

list-manipulation

dplyr