R函数计算给定[不一致]约束的最近邻距离？

Question

我的数据包含已知 X 和 Y 坐标处的树木生长测量值（直径和高度）。我想确定到每棵树的最近邻居 大小相等或更大 的距离。

我看到其他 SE 问题询问最近邻计算（例如，参见 here, here, here, here 等），但 none 指定要搜索的最近邻的约束条件。

是否有一个函数（或其他变通方法）可以让我确定一个点最近邻居的距离鉴于最近的点满足一些条件（例如，大小必须等于或大于兴趣点）?

[一组更复杂的约束会更有帮助...]

对于我的例子：指定一棵树必须也与感兴趣的树在同一地块中或者与感兴趣的树是同一物种

Answer 1

我会使用非等值连接和 data.table

编辑：（仅供参考，这需要 data.table 1.9.7，您可以从 github 获得）

EDIT2：使用 data.table 的副本完成它，因为它似乎是在自己的门槛上加入。我会在以后解决这个问题，但目前可以使用。

library(data.table)
dtree <- data.table(id = 1:1000,
                    x = runif(1000), 
                    y = runif(1000), 
                    height = rnorm(1000,mean = 100,sd = 10),
                    species = sample(LETTERS[1:3],1000,replace = TRUE),
                    plot = sample(1:3,1000, replace = TRUE))
dtree_self <- copy(dtree)
dtree_self[,thresh1 := height + 10]
dtree_self[,thresh2 := height - 10]

# Join on a range, must be a cartesian join, since there are many candidates
test <- dtree[dtree_self, on = .(height >= thresh2, 
                            height <= thresh1), 
              allow.cartesian = TRUE]

# Calculate the distance
test[, dist := (x - i.x)**2 + (y - i.y)**2]

# Exclude identical matches and
# Take the minimum distance grouped by id
final <- test[id != i.id, .SD[which.min(dist)],by = id]

根据给定的阈值，最终数据集包含每一对

编辑：

有附加变量：

如果你想加入额外的参数，这允许你这样做，（如果你另外加入像地块或物种这样的东西可能会更快，因为笛卡尔连接会更小）

这是一个连接两个额外的分类变量，物种和地块的例子：

 library(data.table)
dtree <- data.table(id = 1:1000,
                    x = runif(1000), 
                    y = runif(1000), 
                    height = rnorm(1000,mean = 100,sd = 10),
                    species = sample(LETTERS[1:3],1000,replace = TRUE),
                    plot = sample(1:3,1000, replace = TRUE))
dtree_self <- copy(dtree)
dtree_self[,thresh1 := height + 10]
dtree_self[,thresh2 := height - 10]

# Join on a range, must be a cartesian join, since there are many candidates
test <- dtree[dtree_self, on = .(height >= thresh2, 
                            height <= thresh1,
                            species == species,
                            plot == plot),
              nomatch = NA,
              allow.cartesian = TRUE]

# Calculate the distance
test[, dist := (x - i.x)**2 + (y - i.y)**2]

# Exclude identical matches and
# Take the minimum distance grouped by id
final <- test[id != i.id, .SD[which.min(dist)],by = id]
final

> final
      id         x         y    height species plot  height.1 i.id       i.x       i.y  i.height        dist
  1:   3 0.4837348 0.4325731  91.53387       C    2 111.53387  486 0.5549221 0.4395687 101.53387 0.005116568
  2:  13 0.8267298 0.3137061  94.58949       C    2 114.58949  754 0.8408547 0.2305702 104.58949 0.007111079
  3:  29 0.2905729 0.4952757  89.52128       C    2 109.52128  333 0.2536760 0.5707272  99.52128 0.007054301
  4:  37 0.4534841 0.5249862  89.95493       C    2 109.95493   72 0.4807242 0.6056771  99.95493 0.007253044
  5:  63 0.1678515 0.8814829  84.77450       C    2 104.77450  289 0.1151764 0.9728488  94.77450 0.011122404
 ---                                                                                                        
994: 137 0.8696393 0.2226888  66.57792       C    2  86.57792  473 0.4467795 0.6881008  76.57792 0.395418724
995: 348 0.3606249 0.1245749 110.14466       A    2 130.14466  338 0.1394011 0.1200064 120.14466 0.048960849
996: 572 0.6562758 0.1387882 113.61821       A    2 133.61821  348 0.3606249 0.1245749 123.61821 0.087611511
997: 143 0.9170504 0.1171652  71.39953       C    3  91.39953  904 0.6954973 0.3690599  81.39953 0.112536771
998: 172 0.6834473 0.6221259  65.52187       A    2  85.52187  783 0.4400028 0.9526355  75.52187 0.168501816
>

注意：在最后的答案中，有列 height 和 height.1，后者似乎是 data.table 的 equi join 的结果，分别代表上边界和下边界。

内存高效解决方案

@theforestecologist 的一个问题是这需要大量内存才能完成，

（在这种情况下，还有 42 列被笛卡尔连接相乘，这导致了内存问题），

但是，我们可以通过使用 .EACHI（我相信）以更节省内存的方式来做到这一点。因为我们不会将完整的 table 加载到内存中。该解决方案如下：

library(data.table)
dtree <- data.table(id = 1:1000,
                    x = runif(1000), 
                    y = runif(1000), 
                    height = rnorm(1000,mean = 100,sd = 10),
                    species = sample(LETTERS[1:3],1000,replace = TRUE),
                    plot = sample(1:3,1000, replace = TRUE))
dtree_self <- copy(dtree)
dtree_self[,thresh1 := height + 10]
dtree_self[,thresh2 := height - 10]

# In order to navigate the sometimes unusual nature of scoping inside a
# data.table join, I set the second table to have its own uniquely named id
dtree_self[,id2 := id]
dtree_self[,id := NULL]


# for clarity inside the brackets, 
# I define the squared euclid distance
eucdist <- function(x,xx,y,yy) (x - xx)**2 + (y - yy)**2 

# Join on a range, must be a cartesian join, since there are many candidates 
# Return a table of matches, using .EACHI to keep from loading too much into mem
test <- dtree[dtree_self, on = .(height >= thresh2, 
                                 height <= thresh1,
                                 species,
                                 plot),
              .(id2, id[{z = eucdist(x,i.x,y,i.y); mz <- min(z[id2 != id]); mz == z}]),
              by = .EACHI,
              nomatch = NA,
              allow.cartesian = TRUE]

# join the metadata back onto each id
test <- dtree[test, on = .(id = V2), nomatch = NA]
test <- dtree[test, on = .(id = id2), nomatch = NA]

> test
        id          x          y    height species plot i.id        i.x        i.y  i.height i.species i.plot i.height.2 i.height.1 i.species.1 i.plot.1
   1:    1 0.17622235 0.66547312  84.68450       B    2  965 0.17410840 0.63219350  93.60226         B      2   74.68450   94.68450           B        2
   2:    2 0.04523011 0.33813054  89.46288       B    2  457 0.07267547 0.35725229  88.42827         B      2   79.46288   99.46288           B        2
   3:    3 0.24096368 0.32649256 103.85870       C    3  202 0.20782303 0.38422814  94.35898         C      3   93.85870  113.85870           C        3
   4:    4 0.53160655 0.06636979 101.50614       B    1  248 0.47382417 0.01535036 103.74101         B      1   91.50614  111.50614           B        1
   5:    5 0.83426727 0.55380451 101.93408       C    3  861 0.78210747 0.52812487  96.71422         C      3   91.93408  111.93408           C        3

这样我们应该保持较低的总内存使用率。

R函数计算给定[不一致]约束的最近邻距离？

R function to calculate nearest neighbor distance given [inconsistent] constraint?

r

constraints

distance

nearest-neighbor

euclidean-distance

编辑：

有附加变量：

内存高效解决方案