r 中具有邻域策略的空间聚类 3d 数组

Question

我需要使用邻域策略在 3d 数组中执行空间聚类。更清楚地说：我有一个表示为稀疏 3d 数组的 3d 图像。有些实例是 1，而大多数是 0。我想将彼此相邻的等于 1 的实例聚集在一起（即，如果我们将每个实例想象成一个立方体，我想将共享一个实例的实例聚集在一起面、边或角，等于 1)。

我需要在 R 中执行此操作，因为此步骤是较长的机器学习管道的一部分，并且我正在尝试在单个环境中实施整个管道以最大程度地减少麻烦。我找到了一个与当前问题略有相关的已回答问题 here。然而，在那种情况下，集群的数量是事先已知的，而在我的情况下，集群的数量可以是从 1 到实例数等于 1 的任何值（前提是没有实例与另一个实例相邻）。

我可以为此目标编写一个函数，但这会很耗时，而且可能效率不高，因为除了寻找非零实例之外，我想不出任何其他策略，检查每个邻居实例，如果有的话这些是非零的，而不是检查它的邻居等等。

由于聚类步骤包含在嵌套交叉验证循环中，您可以亲眼看到我需要更高效的东西（或者可能只是用 C 编写的相同东西，以便更快）。

你们中有人知道可以帮助我的功能或软件包吗？

更新

为了回答评论，我的 "sparse" 数组在大多数元素为零的意义上是稀疏的，而不是在以稀疏格式保存的意义上。这是一个玩具示例（这确实是我原始数组的非零元素周围的裁剪，具有暗淡的 (91,109,91)）。

sparse_array = structure(c(0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 
0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 
1, 1, 1, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 1, 0, 0, 0, 
0, 0, 0, 0, 0, 0, 1, 0, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 
0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 
0, 0, 0, 0, 0, 0, 0), .Dim = c(13L, 3L, 6L))

更新 2

我正在 Windows x64 机器上工作，使用 RStudio 1.0.153 和 R 版本 3.4.2（Short Summer）

更新 3

我已经尝试过@gdkrmr 给出的答案，虽然它对于给定的示例工作正常，但它无法推广到更大更复杂的图像。具体来说，它在我的图像中过度分离了集群，这意味着确实相互接触的体素有时会被分成不同的集群。你可以自己想象一下下载这个image和运行下面的代码

读取 3d 图像

library(oro.nifti)
roi <- readNIfTI("image_to_cluster.nii")
roi_img <- cal_img(roi)

以数组形式读取数据

array_img <- roi@.Data

以稀疏格式转换

sparse_format <- (array_img > 0) %>%
  which(., arr.ind = TRUE)

找到相邻体素

neighborhoods <- sparse_format %>%
  dist %>%
  as.matrix %>%
  {. < 2}

分配集群标签

cluster <- 1:nrow(sparse_format)
for (i in 1:nrow(sparse_format)) {
  cl_idx <- cluster[i]
  cluster[neighborhoods[, i]] <- cl_idx
}
sparse_format <- sparse_format %>%
  as_data_frame(.) %>%
  mutate(cluster_id = cluster)

将簇写入新的 3d 图像

new_img <- roi
new_img@.Data <- array(0,c(74,92,78))

for (cl in cluster) {
  new_img@.Data[sparse_format %>% filter(., cluster_id == cl) %>% select(dim1,dim2,dim3) %>% as.matrix] <- cl
}
writeNIfTI(new_img, "test", verbose=TRUE)

现在，如果您打开文件 test.nii.gz（您可以使用 mricron 等），您会看到在大致坐标 37 23 15 处有一个大集群已被拆分在 3 个不同的集群中，即使所有体素都已连接。

Answer 1

您可以使用 spatstat 包来完成此操作。你需要新的从 github 创建了分支 connected.pp3，如果你可以安装它加载了 devtools 或 remotes 包（这里我使用 remotes):

library(remotes)
install_github("spatstat/spatstat")

library(spatstat)

网格和边界框

grid <- expand.grid(0:4,0:4,0:4)
bb <- box3(range(grid[,1]), range(grid[,2]), range(grid[,3]))

稀疏数据数组（以及稀疏行的 id）

grid$id <- 1:nrow(grid)
set.seed(42)
a <- grid[sample(nrow(grid), 20),]
a
#>     Var1 Var2 Var3  id
#> 115    4    2    4 115
#> 117    1    3    4 117
#> 36     0    2    1  36
#> 102    1    0    4 102
#> 78     2    0    3  78
#> 63     2    2    2  63
#> 88     2    2    3  88
#> 16     0    3    0  16
#> 77     1    0    3  77
#> 82     1    1    3  82
#> 53     2    0    2  53
#> 116    0    3    4 116
#> 106    0    1    4 106
#> 29     3    0    1  29
#> 52     1    0    2  52
#> 104    3    0    4 104
#> 107    1    1    4 107
#> 13     2    2    0  13
#> 51     0    0    2  51
#> 60     4    1    2  60

转换为 3D 点模式并找到连接的组件（返回为所谓的分数）。正如@gdkrmr 所指出的那样距离小于 2 是邻居（这里我们使用 1.8，但任何在 sqrt(3) 和 2 之间应该可以工作。

x <- pp3(a[,1], a[,2], a[,3], bb)
x_labelled <- connected.pp3(x, R = 1.8)
df <- data.frame(cluster_id = marks(x_labelled), point_id = a$id)

为了更好的打印，我们根据簇 id 排序

df[order(df$cluster_id, df$point_id),]
#>    cluster_id point_id
#> 1           1      115
#> 14          2       29
#> 19          2       51
#> 15          2       52
#> 11          2       53
#> 20          2       60
#> 6           2       63
#> 9           2       77
#> 5           2       78
#> 10          2       82
#> 7           2       88
#> 4           2      102
#> 16          2      104
#> 13          2      106
#> 17          2      107
#> 12          2      116
#> 2           2      117
#> 8           3       16
#> 3           3       36
#> 18          4       13

Answer 2

这是一个纯R的解决方案，它利用了相邻体素的最大距离是sqrt(d) < 2 if d <= 3:

library(rgl)
library(magrittr)
sparse_format <- (sparse_array > 0) %>%
  which(., arr.ind = TRUE)
neighborhoods <- sparse_format %>%
  dist %>%
  as.matrix %>%
  {. < 2}
n <- nrow(sparse_format)

perm <- 1:n
for (i in 1:n) {
  perm[i:n] <- perm[i:n][
    order(neighborhoods[perm[i], perm][i:n], 
          decreasing = TRUE)
  ]
}
neighborhoods <- neighborhoods[perm, perm]
sparse_format <- sparse_format[perm, ]

cluster <- 1:n
for (i in 1:n) {
  cl_idx <- cluster[i]
  cluster[neighborhoods[, i]] <- cl_idx
}
plot3d(sparse_format, col = cluster)

更新：添加了 neighborhoods 矩阵的排序以查找连接的集群。这变得非常慢（您的示例图像大约 30 秒），但我认为仍有很大的优化空间。如果您想要一个真正快速的解决方案，请查看 Julia language, especially at Images.jl.

更新：使第一个循环变快。

r 中具有邻域策略的空间聚类 3d 数组

spatial clustering 3d array with neighbourhood strategy in r

r

cluster-analysis

spatial

multidimensional-array