具有数百万数据的 SSIS 可以从源和目标进行比较

Question

我正在尝试了解 SSIS 并且对此几乎没有疑问。

我想比较 2 tables.1 table 在 Sql Server 和另一个在 Oracle.

两者 table 将具有相同的架构，如下所示：

Sql Server:
Id      Amount
1       100
2       200
3       300


Oracle:
Id      Amount
3       3000
2       2000
1       1000

这只是一些示例记录，因为我在源（1200 万）和目标（1200 万）中以某种随机顺序有 2400 万条记录。

任务：我正在尝试比较源数据和目标数据 data.As 基于 joining id column from source and target，源数据和目标数据之间始终存在 1 对 1 匹配并在 Amount column 上进行比较并将不匹配的记录存储在 sql 服务器数据库中，所以我知道 Look up transformation 在这种情况下会起作用。

但是我有一些疑问:

1) 如果我从源和目标查询中触发 select * 那么 2400 万条记录将保留在哪里？在记忆中？

2) 在这种情况下我可以得到内存异常吗？

3) 由于结果集（即）在源和目标中的顺序不同查找会工作吗？它会加载所有源数据，然后匹配目标中的 1 对 1 记录吗通过不加载整个目标数据来获取数据？

4) SSIS 如何处理源和目标的数百万数据比较？

谁能帮我解开以上疑惑？

Answer 1

如果使用查找执行此操作，除非使用完整缓存，否则两个行集都不会完全存储在内存中。如果你使用缓存，那么Target的数据将被存储在内存中，当然，如果你没有足够的可用内存，你可能会出现内存异常。

查找是一个糟糕的主意，因为对于源数据中的每一行，您都将查询目标数据。因此，您将在完成之前针对目标发出 1200 万次单独查询。这是性能最差的选项。

Merge Join 速度更快，因为您的数据已根据匹配键进行了预排序，因此匹配速度要快得多。此外，这两个数据集都不需要保存在内存中。行自由流动，无需等待加载整个数据集。

Here是Lookup和Merge Join的比较。

最快的选择是将目标数据直接加载到与源数据相同的服务器上的暂存 table，并在连接键上索引 table。然后您可以在 SQL 中进行比较，加入索引列，这将为您提供最快的性能。

Answer 2

除了 Tab 的回答之外，OP 还询问了 'how does SSIS performs millions of records comparision from source to target without loading whole data set'

答案：

记住，Merge Join 只接受排序的输入。

Merge is going to walk through two sets in the order that you gave in your input or using the Sort transformation. So, it loads one record from one input and one record from the second input. If the keys match, it will output the row with information from both inputs. The advantage is that SSIS only needs to retain a couple rows in memory.

What if Microsoft decided that there is no requirement for sorting? Then in order for the Merge to work is that it would load all of the rows from one input into memory and then the Merge would look up the row in memory. That means a large amount of memory would be needed.

来源：msdn

具有数百万数据的 SSIS 可以从源和目标进行比较

SSIS with millions of data to compare from source and target

sql-server

ssis

etl

data-comparison