可以在单次传递中为内存结构构建一个 FULL JOIN(不使用 sql!)

Is possible in a single pass build a FULL JOIN for a in-memory structure (not using sql!)

我正在构建内存中的柱状关系引擎。为了提取我想要进行后期实现的值,我在其中收集 positions/indices 找到匹配项并在最后收集值。

现在实施联接,我看不出涵盖所有其他情况的通用联接算法是如何实现的。左、右和内部很容易,但 FULL/Outer 不。这是我使用嵌套循环的天真实现:

pub fn join(&self, compare:&BoolExpr) -> JoinPos
{
    //Allocate at least for inner joins...
    let total = cmp::max(self.left.len(), self.right.len());

    let mut cols_left  = Vec::with_capacity(total);
    let mut cols_right = Vec::with_capacity(total);
    let mut found = false;

    while !self.left.eof() {
        let left = self.left.tuple();
        while !self.right.eof() {
            let right = self.right.tuple();
            if compare(&left, &right) {
                //A match found. Record positions of both cursors
                cols_left.push(self.left.pos() as isize);
                cols_right.push(self.right.pos() as isize);
                found = true;
            }
            self.right.next();
        }
        //Not found a single match at the right cursor..
        if !found {
            cols_left.push(self.left.pos() as isize);
            cols_right.push(-1);
        }
        found = false;
        self.left.next();
        self.right.first();
    }

    JoinPos {
        left:cols_left,
        right:cols_right,
    }
}

现在的问题是,我可以在没有另一遍的情况下找到左边的东西,而不是相反的东西:

Input:
L= [1, 2, 3]
R= [2, 3, 4]

Result. It capture the positions that match. -1 if not found
L | R
======
1   -1
2   1
3   2
-1  3

感谢@philipxy,我有了一个可行的解决方案:

pub fn join(&self, compare:&BoolExpr) -> JoinPos
{
    let total = cmp::max(self.left.len(), self.right.len());

    let mut right_not_founds = HashSet::new();
    let mut cols_left  = Vec::with_capacity(total);
    let mut cols_right = Vec::with_capacity(total);
    let mut found = false;
    let mut first = true;

    while !self.left.eof() {
        let left = self.left.tuple(&self.cols_left);
        while !self.right.eof() {
            let right = self.right.tuple(&self.cols_right);
            if first {
                right_not_founds.insert(self.right.pos());
            }

            if compare(&left, &right) {
                cols_left.push(self.left.pos() as isize);
                cols_right.push(self.right.pos() as isize);
                right_not_founds.remove(&self.right.pos());

                found = true;
            }
            self.right.next();
        }
        if !found {
            cols_left.push(self.left.pos() as isize);
            cols_right.push(-1);
        }
        found = false;
        first = false;
        self.left.next();
        self.right.first();
    }

    for pos in right_not_founds {
        cols_left.push(-1);
        cols_right.push(pos as isize);
    }

    JoinPos {
        left:cols_left,
        right:cols_right,
    }
}

完全联接 returns 内部联接元组并集不匹配的左 table 由空值扩展的元组不匹配的右联合 table 由空值扩展的元组。

目前您的代码输出左连接元组。外循环的每次迭代都会输出更多位于内连接中的元组,或者输出与任何右 table 元组不匹配的左 table 元组的空扩展。要输出完整的连接元组,您还必须输出与任何左 table 元组不匹配的每个右 table 元组的空扩展。您可以按如下方式执行此操作: 在循环之前定义一个设置变量。它最终将包含与任何左 table 元组不匹配的右 table 元组的所有 positions/indices。就在 if compare 之前,如果它是外循环的第一次迭代,则将右元组 position/index 插入集合中。在 if compare 的嵌套块内,从集合中删除右元组 position/index。