可以在单次传递中为内存结构构建一个 FULL JOIN(不使用 sql!)
Is possible in a single pass build a FULL JOIN for a in-memory structure (not using sql!)
我正在构建内存中的柱状关系引擎。为了提取我想要进行后期实现的值,我在其中收集 positions/indices 找到匹配项并在最后收集值。
现在实施联接,我看不出涵盖所有其他情况的通用联接算法是如何实现的。左、右和内部很容易,但 FULL/Outer 不。这是我使用嵌套循环的天真实现:
pub fn join(&self, compare:&BoolExpr) -> JoinPos
{
//Allocate at least for inner joins...
let total = cmp::max(self.left.len(), self.right.len());
let mut cols_left = Vec::with_capacity(total);
let mut cols_right = Vec::with_capacity(total);
let mut found = false;
while !self.left.eof() {
let left = self.left.tuple();
while !self.right.eof() {
let right = self.right.tuple();
if compare(&left, &right) {
//A match found. Record positions of both cursors
cols_left.push(self.left.pos() as isize);
cols_right.push(self.right.pos() as isize);
found = true;
}
self.right.next();
}
//Not found a single match at the right cursor..
if !found {
cols_left.push(self.left.pos() as isize);
cols_right.push(-1);
}
found = false;
self.left.next();
self.right.first();
}
JoinPos {
left:cols_left,
right:cols_right,
}
}
现在的问题是,我可以在没有另一遍的情况下找到左边的东西,而不是相反的东西:
Input:
L= [1, 2, 3]
R= [2, 3, 4]
Result. It capture the positions that match. -1 if not found
L | R
======
1 -1
2 1
3 2
-1 3
感谢@philipxy,我有了一个可行的解决方案:
pub fn join(&self, compare:&BoolExpr) -> JoinPos
{
let total = cmp::max(self.left.len(), self.right.len());
let mut right_not_founds = HashSet::new();
let mut cols_left = Vec::with_capacity(total);
let mut cols_right = Vec::with_capacity(total);
let mut found = false;
let mut first = true;
while !self.left.eof() {
let left = self.left.tuple(&self.cols_left);
while !self.right.eof() {
let right = self.right.tuple(&self.cols_right);
if first {
right_not_founds.insert(self.right.pos());
}
if compare(&left, &right) {
cols_left.push(self.left.pos() as isize);
cols_right.push(self.right.pos() as isize);
right_not_founds.remove(&self.right.pos());
found = true;
}
self.right.next();
}
if !found {
cols_left.push(self.left.pos() as isize);
cols_right.push(-1);
}
found = false;
first = false;
self.left.next();
self.right.first();
}
for pos in right_not_founds {
cols_left.push(-1);
cols_right.push(pos as isize);
}
JoinPos {
left:cols_left,
right:cols_right,
}
}
完全联接 returns 内部联接元组并集不匹配的左 table 由空值扩展的元组不匹配的右联合 table 由空值扩展的元组。
目前您的代码输出左连接元组。外循环的每次迭代都会输出更多位于内连接中的元组,或者输出与任何右 table 元组不匹配的左 table 元组的空扩展。要输出完整的连接元组,您还必须输出与任何左 table 元组不匹配的每个右 table 元组的空扩展。您可以按如下方式执行此操作: 在循环之前定义一个设置变量。它最终将包含与任何左 table 元组不匹配的右 table 元组的所有 positions/indices。就在 if compare
之前,如果它是外循环的第一次迭代,则将右元组 position/index 插入集合中。在 if compare
的嵌套块内,从集合中删除右元组 position/index。
我正在构建内存中的柱状关系引擎。为了提取我想要进行后期实现的值,我在其中收集 positions/indices 找到匹配项并在最后收集值。
现在实施联接,我看不出涵盖所有其他情况的通用联接算法是如何实现的。左、右和内部很容易,但 FULL/Outer 不。这是我使用嵌套循环的天真实现:
pub fn join(&self, compare:&BoolExpr) -> JoinPos
{
//Allocate at least for inner joins...
let total = cmp::max(self.left.len(), self.right.len());
let mut cols_left = Vec::with_capacity(total);
let mut cols_right = Vec::with_capacity(total);
let mut found = false;
while !self.left.eof() {
let left = self.left.tuple();
while !self.right.eof() {
let right = self.right.tuple();
if compare(&left, &right) {
//A match found. Record positions of both cursors
cols_left.push(self.left.pos() as isize);
cols_right.push(self.right.pos() as isize);
found = true;
}
self.right.next();
}
//Not found a single match at the right cursor..
if !found {
cols_left.push(self.left.pos() as isize);
cols_right.push(-1);
}
found = false;
self.left.next();
self.right.first();
}
JoinPos {
left:cols_left,
right:cols_right,
}
}
现在的问题是,我可以在没有另一遍的情况下找到左边的东西,而不是相反的东西:
Input:
L= [1, 2, 3]
R= [2, 3, 4]
Result. It capture the positions that match. -1 if not found
L | R
======
1 -1
2 1
3 2
-1 3
感谢@philipxy,我有了一个可行的解决方案:
pub fn join(&self, compare:&BoolExpr) -> JoinPos
{
let total = cmp::max(self.left.len(), self.right.len());
let mut right_not_founds = HashSet::new();
let mut cols_left = Vec::with_capacity(total);
let mut cols_right = Vec::with_capacity(total);
let mut found = false;
let mut first = true;
while !self.left.eof() {
let left = self.left.tuple(&self.cols_left);
while !self.right.eof() {
let right = self.right.tuple(&self.cols_right);
if first {
right_not_founds.insert(self.right.pos());
}
if compare(&left, &right) {
cols_left.push(self.left.pos() as isize);
cols_right.push(self.right.pos() as isize);
right_not_founds.remove(&self.right.pos());
found = true;
}
self.right.next();
}
if !found {
cols_left.push(self.left.pos() as isize);
cols_right.push(-1);
}
found = false;
first = false;
self.left.next();
self.right.first();
}
for pos in right_not_founds {
cols_left.push(-1);
cols_right.push(pos as isize);
}
JoinPos {
left:cols_left,
right:cols_right,
}
}
完全联接 returns 内部联接元组并集不匹配的左 table 由空值扩展的元组不匹配的右联合 table 由空值扩展的元组。
目前您的代码输出左连接元组。外循环的每次迭代都会输出更多位于内连接中的元组,或者输出与任何右 table 元组不匹配的左 table 元组的空扩展。要输出完整的连接元组,您还必须输出与任何左 table 元组不匹配的每个右 table 元组的空扩展。您可以按如下方式执行此操作: 在循环之前定义一个设置变量。它最终将包含与任何左 table 元组不匹配的右 table 元组的所有 positions/indices。就在 if compare
之前,如果它是外循环的第一次迭代,则将右元组 position/index 插入集合中。在 if compare
的嵌套块内,从集合中删除右元组 position/index。