为什么要将点存储在二叉树中?

Why store the points in a binary tree?

本题涉及软件算法,来自On topic

我正在处理来自 Amazon Software Question 的面试问题, 具体来说
"Given a set of points (x,y) and an integer "n", return n 个接近原点的点数

这是此问题的示例高级伪代码答案,来自 Sample Answer
第 1 步:设计一个名为 class 的点,它具有三个字段 - int x、int y、int distance
第 2 步:对于给定的所有点,找到它们与原点之间的距离
第 3 步:将值存储在二叉树中
第四步:堆排序
第 5 步:打印二叉树的前 n 个值

我同意第 1 步和第 2 步,因为从面向对象设计的角度来看,使用一个数据软件包 Point 封装 x、y 和距离字段是有意义的。Ensapsulation

有人可以解释从 3 到 5 的设计决策吗?

这是我将如何完成第 3 到 5 步
第 3 步:将所有点存储在一个数组中
第 4 步:根据距离对数组进行排序(我在这里使用一些内置排序,例如 Arrays.Sort
第 5 步:将数组按升序排序后,我打印出前 n 个值

为什么该回复的作者使用更复杂的数据结构,二叉树而不是像我使用的数组这样更简单的东西?我知道什么是二叉树 - 具有两个指针的节点的分层数据结构。在他的算法中,你必须使用BST吗?

没有必要使用BST。但是,在需要自排序结构时使用 BST 是一种很好的做法。我不认为有必要同时使用 BST 和堆排序(以某种方式)。您可以只使用 BST 并检索前 n 个点。您还可以使用一个数组,对其进行排序并使用前 n 个点。 如果你想对 Point 类型的数组进行排序,你可以实现 Comparable 接口(Point 将实现该接口)并重载默认方法。 您永远不必选择任何数据结构,但通过确定您的需求,您也可以轻松确定最佳结构。

此 post 中描述的方法比解决此类问题所需的方法更复杂。如您所述,按距离简单排序就足够了。但是,为了帮助解释您对示例答案作者试图得到的内容的困惑,可以考虑 k 最近邻 问题,该问题可以用 k-d tree 结构解决将 space 分区应用于 k-d 数据集。对于二维space,那确实是一棵二叉树。这棵树是固有排序的,不需要任何 "heap sorting."

需要注意的是,构建 k-d树的时间复杂度为O(n log n),只有在需要对结构体。如果您只需要执行一次搜索以找到距离原点最近的 k 个邻居,则可以使用朴素的 O(n) 搜索来完成。

如何构建 k-d 树,直接来自 Wiki:

One adds a new point to a k-d tree in the same way as one adds an element to any other search tree. First, traverse the tree, starting from the root and moving to either the left or the right child depending on whether the point to be inserted is on the "left" or "right" side of the splitting plane. Once you get to the node under which the child should be located, add the new point as either the left or right child of the leaf node, again depending on which side of the node's splitting plane contains the new node.

Adding points in this manner can cause the tree to become unbalanced, leading to decreased tree performance. The rate of tree performance degradation is dependent upon the spatial distribution of tree points being added, and the number of points added in relation to the tree size. If a tree becomes too unbalanced, it may need to be re-balanced to restore the performance of queries that rely on the tree balancing, such as nearest neighbour searching.

一旦构建了这棵树,您就可以在 O(k log n) 时间内找到某个点(您的原点)的 k 个最近邻居。

直接来自 Wiki:

Searching for a nearest neighbour in a k-d tree proceeds as follows:

  1. Starting with the root node, the algorithm moves down the tree recursively, in the same way that it would if the search point were being inserted (i.e. it goes left or right depending on whether the point is lesser than or greater than the current node in the split dimension).
  2. Once the algorithm reaches a leaf node, it saves that node point as the "current best"
  3. The algorithm unwinds the recursion of the tree, performing the following steps at each node:
    1. If the current node is closer than the current best, then it becomes the current best.
    2. The algorithm checks whether there could be any points on the other side of the splitting plane that are closer to the search point than the current best. In concept, this is done by intersecting the splitting hyperplane with a hypersphere around the search point that has a radius equal to the current nearest distance. Since the hyperplanes are all axis-aligned this is implemented as a simple comparison to see whether the difference between the splitting coordinate of the search point and current node is lesser than the distance (overall coordinates) from the search point to the current best.
      1. If the hypersphere crosses the plane, there could be nearer points on the other side of the plane, so the algorithm must move down the other branch of the tree from the current node looking for closer points, following the same recursive process as the entire search.
      2. If the hypersphere doesn't intersect the splitting plane, then the algorithm continues walking up the tree, and the entire branch on the other side of that node is eliminated.
  4. When the algorithm finishes this process for the root node, then the search is complete.

这是一个非常棘手的算法,我不想将其描述为面试问题!幸运的是,这里的一般情况比需要的更复杂,正如您在 post 中指出的那样。但我相信这种方法可能接近您的(错误的)样本答案试图描述的内容。

首先,我不会说 Point(x, y, distance) 是好的设计或封装。 distance 并不是真正的点的一部分,它可以从 xy 计算出来。在设计方面,我肯定会有一个 函数 ,即来自 Point 的静态方法或一个助手 class Points

double distance(Point a, Point b)

那么具体的问题,其实我很赞同你的方案,把数据放在一个数组里,把这个数组排序,然后先取出N。 该示例可能暗示的是堆排序实际上经常使用二叉树结构 inside 要排序的数组 here :

The heap is often placed in an array with the layout of a complete binary tree.

当然,如果到原点的距离没有存储在Point中,出于性能原因,它必须与相应的Point对象放在数组中,或任何信息这将允许从排序的距离(参考,索引)中获取 Point 对象,例如

List<Pair<Long, Point>> distancesToOrigin = new ArrayList<>();

Comparator<Pair<Long, Point>>

排序