Kafka Connect JDBC 与 Debezium CDC

Kafka Connect JDBC vs Debezium CDC

JDBC Connector and Debezium SQL Server CDC Connector(或任何其他关系数据库连接器)之间有什么区别,我什么时候应该选择一个而不是另一个,寻找在两个关系数据库之间同步的解决方案?

不确定这个讨论是否应该是关于 CDC vs JDBC 连接器,而不是 Debezium SQL 服务器 CDC 连接器,或者甚至只是 Debezium,期待以后的编辑,取决于给定的答案(虽然我的情况是关于 SQL 服务器接收器)。

与您分享我对这个主题的研究,这让我想到了这个问题(作为答案)

这个解释主要是Debezium SQL Server CDC Connector and JDBC Connector, with more general interpretation about Debezium and CDC之间的区别。

tl;dr- 向下滚动 :)


Debezium

Debezium 仅用作源连接器,记录所有行级更改。
Debezium Documentation 说:

Debezium is a set of distributed services to capture changes in your databases so that your applications can see those changes and respond to them. Debezium records all row-level changes within each database table in a change event stream, and applications simply read these streams to see the change events in the same order in which they occurred.

SQL 服务器的 Debezium 连接器首先记录数据库的快照,然后将行级更改的记录发送到 Kafka,每个 table 到不同的 Kafka 主题。
Debezium Connector for SQL Server Documentation 说:

Debezium’s SQL Server Connector can monitor and record the row-level changes in the schemas of a SQL Server database.

The first time it connects to a SQL Server database/cluster, it reads a consistent snapshot of all of the schemas. When that snapshot is complete, the connector continuously streams the changes that were committed to SQL Server and generates corresponding insert, update and delete events. All of the events for each table are recorded in a separate Kafka topic, where they can be easily consumed by applications and services.


卡夫卡连接JDBC

Kafka Connect JDBC 可以用作 Kafka 的源连接器或接收器连接器,支持具有 JDBC 驱动程序的任何数据库。
JDBC Connector Documentation 说:

You can use the Kafka Connect JDBC source connector to import data from any relational database with a JDBC driver into Apache Kafka® topics. You can use the JDBC sink connector to export data from Kafka topics to any relational database with a JDBC driver. The JDBC connector supports a wide variety of databases without requiring custom code for each one.

他们有一些 specifications about installing on Microsoft SQL Server 我认为与此讨论无关。

因此,如果 JDBC 连接器同时支持源和接收器,而 Debezium 仅支持源(不支持接收器),我们理解为了使用 JDBC 驱动程序(接收器)将数据从 Kafka 写入数据库),JDBC 连接器是必经之路(包括 SQL 服务器)。

现在应该只将比较范围缩小到源字段。
JDBC Source Connector Documentation 一看就不多说了:

Data is loaded by periodically executing a SQL query and creating an output record for each row in the result set. By default, all tables in a database are copied, each to its own output topic. The database is monitored for new or deleted tables and adapts automatically. When copying data from a table, the connector can load only new or modified rows by specifying which columns should be used to detect new or modified data.


进一步搜索以了解它们的差异,在这个使用 Debezium MySQL 连接器作为源和 JDBC 连接器作为接收器的 Debezium blog 中,有一个解释关于两者之间的区别,这通常告诉我们 Debezium 提供了有关数据库更改的更多信息的记录,而 JDBC Connector 提供的记录更侧重于将数据库更改转换为简单的 insert/upsert 命令:

The Debezium MySQL Connector was designed to specifically capture database changes and provide as much information as possible about those events beyond just the new state of each row. Meanwhile, the Confluent JDBC Sink Connector was designed to simply convert each message into a database insert/upsert based upon the structure of the message. So, the two connectors have different structures for the messages, but they also use different topic naming conventions and behavior of representing deleted records.

而且,他们有不同的主题命名和不同的删除方法:

Debezium uses fully qualified naming for target topics representing each table it manages. The naming follows the pattern [logical-name].[database-name].[table-name]. Kafka Connect JDBC Connector works with simple names [table-name].

...

When the Debezium connector detects a row is deleted, it creates two event messages: a delete event and a tombstone message. The delete message has an envelope with the state of the deleted row in the before field, and an after field that is null. The tombstone message contains same key as the delete message, but the entire message value is null, and Kafka’s log compaction utilizes this to know that it can remove any earlier messages with the same key. A number of sink connectors, including the Confluent’s JDBC Sink Connector, are not expecting these messages and will instead fail if they see either kind of message.

Confluent blog 详细解释了 CDC 和 JDBC 连接器的工作原理,它(JDBC 连接器)每隔固定时间间隔对源数据库执行查询,即不是非常可扩展的解决方案,而 CDC 具有更高的频率,从数据库事务日志流式传输

The connector works by executing a query, over JDBC, against the source database. It does this to pull in all rows (bulk) or those that changed since previously (incremental). This query is executed at the interval defined in poll.interval.ms. Depending on the volumes of data involved, the physical database design (indexing, etc.), and other workload on the database, this may not prove to be the most scalable option.

...

Done properly, CDC basically enables you to stream every single event from a database into Kafka. Broadly put, relational databases use a transaction log (also called a binlog or redo log depending on DB flavour), to which every event in the database is written. Update a row, insert a row, delete a row – it all goes to the database’s transaction log. CDC tools generally work by utilising this transaction log to extract at very low latency and low impact the events that are occurring on the database (or a schema/table within it).

这篇博客也说了CDC和JDBC Connector的区别,主要是说JDBC Connector不支持同步删除的记录,适合做原型,CDC适合对于更成熟的系统

The JDBC Connector cannot fetch deleted rows. Because, how do you query for data that doesn’t exist?

...

My general steer on CDC vs JDBC is that JDBC is great for prototyping, and fine low-volume workloads. Things to consider if using the JDBC connector:

Doesn’t give true CDC (capture delete records, want before/after record versions) Latency in detecting new events Impact of polling the source database continually (and balancing this with the desired latency) Unless you’re doing a bulk pull from a table, you need to have an ID and/or timestamp that you can use to spot new records. If you don’t own the schema, this becomes a problem.


tl;dr 结论

Debezium 和 JDBC 连接器之间的主要区别是:

  1. Debezium 仅用作 Kafka 源,JDBCConnector 可用作 Kafka 源和接收器。

来源:

  1. JDBC 连接器不支持同步已删除的记录,而 Debezium 支持。
  2. JDBC 连接器每隔固定时间间隔查询一次数据库,这不是可扩展性很强的解决方案,而 CDC 频率更高,从数据库事务日志流式传输。
  3. Debezium 提供了有关数据库更改的更多信息的记录,JDBC Connector 提供的记录更侧重于将数据库更改转换为简单的 insert/upsert 命令。
  4. 不同的主题命名。

我们可以简单地说,CDC 是一种基于日志的流式传输,类似于 kafka connect jdbc 源连接器是基于查询的流式传输。:)

在“JDBC 连接器”中,您无法捕获新表、列等 DDL 更改。使用 Debezium 连接器,您可以跟踪数据结构更改,因此如果需要,您还可以调整接收器连接器。