CouchDB 2 中的序列号错误或是否有另一种比较序列号的方法?
Sequence number bug in CouchDB 2 or is there another way to compare sequence numbers?
我正在深入研究 CouchDB 2,我发现了一些意外的序列号排序。在一种情况下,我发现 _changes 提要中的早期更改具有序列号
99-g1AAAAI-eJyd0EsOgjAQBuAGiI-dN9C9LmrBwqzkJtrSNkgQV6z1JnoTvYneBEvbhA0aMU1mkj6-_NMSITTJfYFm2anOcsFT10mpTzyG-LxpmiL32eqoN8aEAcWE9dz_jPCFrnzrHGQchiFM4kSgaV0JqQ6VFF-AtAV2DggMgCEGxrNhQfatc3bOyDiKUalg2EBVoCu66KapazcUh41e69-GssjNIvcWWRokk2oNofwj0MNazy4QFURhGQ0J9LKI-SHPIBHEgiak51nxBhxnrRk
对于同一数据库,我的 _changes 提要中的最后一个序列号是
228-g1AAAAJFeJyd0EkOgjAUBuAGTJCdN9AjlIKFruQm2jFAEFes9SZ6E72J3gQ7JW7QCGnyXtLhy-vfAgCWVSjAip96XglW-o5afRJQwNbDMDRVSOuj3ogQJRgiOnL_O8I2urKdd4B1KCRpkRcCxH0npKo7KX4ApQH2HogsAElOKOPTBjkY5-yd2DqKYqnItA91C13BRTdNXY0VWouRrV7JDOvmrLuxlLW4VAlJ5Qzr4aznJ2wskIIy-y9sh7wcYoMKLJKRXOACjTxr3uHcsBE
在浏览器控制台中,以下是错误的
'228-g1AAAAJFeJyd0EkOgjAUBuAGTJCdN9AjlIKFruQm2jFAEFes9SZ6E72J3gQ7JW7QCGnyXtLhy-vfAgCWVSjAip96XglW-o5afRJQwNbDMDRVSOuj3ogQJRgiOnL_O8I2urKdd4B1KCRpkRcCxH0npKo7KX4ApQH2HogsAElOKOPTBjkY5-yd2DqKYqnItA91C13BRTdNXY0VWouRrV7JDOvmrLuxlLW4VAlJ5Qzr4aznJ2wskIIy-y9sh7wcYoMKLJKRXOACjTxr3uHcsBE' > '99-g1AAAAI-eJyd0EsOgjAQBuAGiI-dN9C9LmrBwqzkJtrSNkgQV6z1JnoTvYneBEvbhA0aMU1mkj6-_NMSITTJfYFm2anOcsFT10mpTzyG-LxpmiL32eqoN8aEAcWE9dz_jPCFrnzrHGQchiFM4kSgaV0JqQ6VFF-AtAV2DggMgCEGxrNhQfatc3bOyDiKUalg2EBVoCu66KapazcUh41e69-GssjNIvcWWRokk2oNofwj0MNazy4QFURhGQ0J9LKI-SHPIBHEgiak51nxBhxnrRk'
这是一个错误还是我需要使用其他方法来比较序列号?
在查看我的 _changes 提要中的其他序列号时,看起来它们通常按照我的预期进行排序,但在这种情况下,似乎第一个数字,例如99,从 2 位跳到 3 位,排序中断。如果将其归结为一个简单的字符串比较示例,您可以看到 '228' > '99' => false
以下答案包含来自@rnewson 的电子邮件线程的摘录。我希望它能帮助其他人理解 CouchDB 2 中的序列号。谢谢,罗伯特!
背景:
There's no easy way to compare them in 2.0 and no requirement for them
to be in order. They are not, in short, designed to be examined or
compared outside of couchdb; treat them opaquely.
The number on the front is the sum of the individual update sequences
encoded in the second part and exists only to trick older versions of
the couchdb replicator into making checkpoints.
The latter half of the sequence string is an encoded list of {node,
range, seq} tuples (where seq is the integer value you know from
pre-2.0 releases). When a sequence string is passed back in, as the
since= parameter, couchdb decodes this string and passes the
appropriate integer seq value to the individual shard.
All that said, in general the front number should increase. The full
strings themselves are not comparable, since there's no defined order
to the encoded list (so two strings could be generated that are
encoded differently but decode to the same list of tuples, just in a
different order).
Another aspect to this is that the changes feed is not totally
ordered. For a given shard it is totally ordered (a shard being
identical to a pre 2.0 database with an integer sequence), couchdb
doesn't shuffle that output (though correctness of replication would
be retained if it did). A clustered database is comprised of several
shards, though (the 'q' value, defaulting to 4 iirc). The clustered
changes feed combines those separate changes feed into a single one,
but makes no effort to impose a total order over that. We don't do it
because it would be expensive and unnecessary.
如果您需要监听 _changes 提要然后重新启动的解决方案
从您稍后离开的地方开始:
The algorithm for correctly consuming the changes feed is:
- read /dbname/_changes
- process each row idempotently
- periodically (every X seconds or every X rows) store the "seq" value of the last row you processed
If you ever crash, or if you weren't using continuous=true, you can do
this same procedure again but modified in step 1;
revised 1. read /dbname/_changes?since=X
where X is the value you saved in step 3. If you're not using
continuous mode then you could just record the "last_seq" value at the
end of consuming the non-continuous response. You run the risk of
reprocessing a lot more items, though.
With this scheme (which the replicator and all indexers follow), you
don't care if the results come out of order, you don't need to compare
any two seq values.
You do need to ensure you can correctly process the same change
multiple times. For an example of that, consider the replicator, when
it sees a row from a changes feed it asks the target database if it
contains the _id and _rev values from that row. If it does, the
replicator moves on to the next row. If it doesn't, it tries to write
the document in that row to the target database. In the event of a
crash, and therefore a call to _changes with a seq value from before
processing that row, it will ask the target database if it has the
_id/_rev again, only this time the target will say yes.
我正在深入研究 CouchDB 2,我发现了一些意外的序列号排序。在一种情况下,我发现 _changes 提要中的早期更改具有序列号
99-g1AAAAI-eJyd0EsOgjAQBuAGiI-dN9C9LmrBwqzkJtrSNkgQV6z1JnoTvYneBEvbhA0aMU1mkj6-_NMSITTJfYFm2anOcsFT10mpTzyG-LxpmiL32eqoN8aEAcWE9dz_jPCFrnzrHGQchiFM4kSgaV0JqQ6VFF-AtAV2DggMgCEGxrNhQfatc3bOyDiKUalg2EBVoCu66KapazcUh41e69-GssjNIvcWWRokk2oNofwj0MNazy4QFURhGQ0J9LKI-SHPIBHEgiak51nxBhxnrRk
对于同一数据库,我的 _changes 提要中的最后一个序列号是
228-g1AAAAJFeJyd0EkOgjAUBuAGTJCdN9AjlIKFruQm2jFAEFes9SZ6E72J3gQ7JW7QCGnyXtLhy-vfAgCWVSjAip96XglW-o5afRJQwNbDMDRVSOuj3ogQJRgiOnL_O8I2urKdd4B1KCRpkRcCxH0npKo7KX4ApQH2HogsAElOKOPTBjkY5-yd2DqKYqnItA91C13BRTdNXY0VWouRrV7JDOvmrLuxlLW4VAlJ5Qzr4aznJ2wskIIy-y9sh7wcYoMKLJKRXOACjTxr3uHcsBE
在浏览器控制台中,以下是错误的
'228-g1AAAAJFeJyd0EkOgjAUBuAGTJCdN9AjlIKFruQm2jFAEFes9SZ6E72J3gQ7JW7QCGnyXtLhy-vfAgCWVSjAip96XglW-o5afRJQwNbDMDRVSOuj3ogQJRgiOnL_O8I2urKdd4B1KCRpkRcCxH0npKo7KX4ApQH2HogsAElOKOPTBjkY5-yd2DqKYqnItA91C13BRTdNXY0VWouRrV7JDOvmrLuxlLW4VAlJ5Qzr4aznJ2wskIIy-y9sh7wcYoMKLJKRXOACjTxr3uHcsBE' > '99-g1AAAAI-eJyd0EsOgjAQBuAGiI-dN9C9LmrBwqzkJtrSNkgQV6z1JnoTvYneBEvbhA0aMU1mkj6-_NMSITTJfYFm2anOcsFT10mpTzyG-LxpmiL32eqoN8aEAcWE9dz_jPCFrnzrHGQchiFM4kSgaV0JqQ6VFF-AtAV2DggMgCEGxrNhQfatc3bOyDiKUalg2EBVoCu66KapazcUh41e69-GssjNIvcWWRokk2oNofwj0MNazy4QFURhGQ0J9LKI-SHPIBHEgiak51nxBhxnrRk'
这是一个错误还是我需要使用其他方法来比较序列号?
在查看我的 _changes 提要中的其他序列号时,看起来它们通常按照我的预期进行排序,但在这种情况下,似乎第一个数字,例如99,从 2 位跳到 3 位,排序中断。如果将其归结为一个简单的字符串比较示例,您可以看到 '228' > '99' => false
以下答案包含来自@rnewson 的电子邮件线程的摘录。我希望它能帮助其他人理解 CouchDB 2 中的序列号。谢谢,罗伯特!
背景:
There's no easy way to compare them in 2.0 and no requirement for them to be in order. They are not, in short, designed to be examined or compared outside of couchdb; treat them opaquely.
The number on the front is the sum of the individual update sequences encoded in the second part and exists only to trick older versions of the couchdb replicator into making checkpoints.
The latter half of the sequence string is an encoded list of {node, range, seq} tuples (where seq is the integer value you know from pre-2.0 releases). When a sequence string is passed back in, as the since= parameter, couchdb decodes this string and passes the appropriate integer seq value to the individual shard.
All that said, in general the front number should increase. The full strings themselves are not comparable, since there's no defined order to the encoded list (so two strings could be generated that are encoded differently but decode to the same list of tuples, just in a different order).
Another aspect to this is that the changes feed is not totally ordered. For a given shard it is totally ordered (a shard being identical to a pre 2.0 database with an integer sequence), couchdb doesn't shuffle that output (though correctness of replication would be retained if it did). A clustered database is comprised of several shards, though (the 'q' value, defaulting to 4 iirc). The clustered changes feed combines those separate changes feed into a single one, but makes no effort to impose a total order over that. We don't do it because it would be expensive and unnecessary.
如果您需要监听 _changes 提要然后重新启动的解决方案 从您稍后离开的地方开始:
The algorithm for correctly consuming the changes feed is:
- read /dbname/_changes
- process each row idempotently
- periodically (every X seconds or every X rows) store the "seq" value of the last row you processed
If you ever crash, or if you weren't using continuous=true, you can do this same procedure again but modified in step 1;
revised 1. read /dbname/_changes?since=X
where X is the value you saved in step 3. If you're not using continuous mode then you could just record the "last_seq" value at the end of consuming the non-continuous response. You run the risk of reprocessing a lot more items, though.
With this scheme (which the replicator and all indexers follow), you don't care if the results come out of order, you don't need to compare any two seq values.
You do need to ensure you can correctly process the same change multiple times. For an example of that, consider the replicator, when it sees a row from a changes feed it asks the target database if it contains the _id and _rev values from that row. If it does, the replicator moves on to the next row. If it doesn't, it tries to write the document in that row to the target database. In the event of a crash, and therefore a call to _changes with a seq value from before processing that row, it will ask the target database if it has the _id/_rev again, only this time the target will say yes.