用于 R 树实现的 R 库

Question

我有数据框，例如

df <- data.frame(x = 1:1e3, y = rnorm(1e3))

我需要在 N 个（在我的例子中 N = 6、12 和 24）矩形上拆分具有相同点数的点。如何使用 R-tree algorithm?

拆分我的 df

Answer 1

如果所有点的 x 坐标都不同，如您的示例中的情况，请根据 x 坐标对点进行递增排序。请注意，在这种情况下，您可以简化为 2d 点找到矩形覆盖（具有相同数量的点）的问题，以找到具有 1d 点段的覆盖（即您可以忽略矩形的高度）。

这里是如何找到每个矩形中的点的方法：

num_rect <- 7 # In your example 6, 12 or 24
num_points <- 10 # In your example 1e3

# Already ordered according to x
df <- data.frame(x = 1:num_points, y = rnorm(num_points))

# Minimum number of points in the rectangles to cover all of them
points_in_rect <- ceiling(num_points/num_rect)

# Cover the first points using non-overlaping rectangles
breaks <- seq(0,num_points, by=points_in_rect)
cover <- split(seq(num_points), cut(seq(num_points), breaks))
names(cover) <- paste0("rect", seq(length(cover)))

# Cover the last points using overlaping rectangles
cur_num <- length(cover)
if (num_points < num_rect*points_in_rect ) {
  # To avoid duplicate rectangles
  last <- num_points
  if (num_points %% 1 == 0)
    last <- last -1

  while (cur_num < num_rect) {
    cur_num <- cur_num + 1
    new_rect <- list(seq(last-points_in_rect+1, last))
    names(new_rect) <- paste0("rect", cur_num)
    cover <- c(cover,new_rect)
    last <- last - points_in_rect
  }
}

矩形中的点是：

$rect1
[1] 1 2

$rect2
[1] 3 4

$rect3
[1] 5 6

$rect4
[1] 7 8

$rect5
[1]  9 10

$rect6
[1] 8 9

$rect7
[1] 6 7

包围这些点集的最小边界矩形（平行于轴）就是您要找到的那些。

两个轴上的重复坐标值

随机旋转点（保存旋转角度）并检查是否没有重复的x（或y）坐标。如果是这种情况，将上面的策略与旋转后的坐标一起使用（记得在旋转后的点之前根据新的x坐标进行排序），然后将得到的矩形按相反的方向旋转回来。如果重复的坐标保留在两个轴中，请以不同的（随机）角度再次旋转这些点。由于您的点数有限，因此您总能找到一个将 de x（或 y）坐标分开的旋转角度。

Answer 2

对于 x 轴上均匀分布的数据，kmeans 聚类效果很好（毫不奇怪）：

library(dplyr)
library(ggplot2)

set.seed(1)
df <- data.frame(x = 1:1e3, y = rnorm(1e3))

N <- 10
df$cluster <- kmeans(df,N)$cluster

cluster_rectangles <- df %>% group_by(cluster) %>% 
       summarize(xmin = min(x),
                 xmax = max(x),
                 ymin = min(y),
                 ymax = max(y),
                 n = n())  

ggplot() + geom_rect(data = cluster_rectangles, mapping=aes(xmin=xmin, xmax=xmax, ymin=ymin, ymax=ymax, fill=cluster)) +
           geom_point(data = df,mapping=aes(x,y),color='white')

如果 x 分布正常，它也有效：

df <- data.frame(x = rnorm(1e3), y = rnorm(1e3))

缺点是每个矩形的点数不同：

> cluster_rectangles %>% select(cluster,n)
# A tibble: 10 x 2
   cluster     n
     <int> <int>
 1       1   137
 2       2    58
 3       3   121
 4       4    61
 5       5    72
 6       6   184
 7       7    78
 8       8    70
 9       9   126
10      10    93

对于均匀分布，结果很好（N=9）：

用于 R 树实现的 R 库

R library for R-tree implementation

tree

r

cluster-analysis

两个轴上的重复坐标值