如何在Python中生成一个按时间排序的uid？

Question

这可能吗？我听说 Cassandra 有类似的东西：https://datastax.github.io/python-driver/api/cassandra/util.html

我一直在使用 ISO timestamp 与 uuid4 连接，但结果太大（58 个字符）并且可能有点矫枉过正。

保留序号在我的上下文中不起作用 (DynamoDB NoSQL)

值得注意的是，对于我的应用程序，在 batch/same 秒内创建的项目是否随机排列并不重要，只要 uid 不折叠即可。

我对最大长度没有具体限制，理想情况下我希望看到不同长度的碰撞机会不同，但它需要小于58（我的原始尝试）

这是与 DynamoDB（NoSQL 数据库）一起用作排序键

Answer 1

在您链接的文章中，cassandra.util.uuid_from_time(time_arg, node=None, clock_seq=None)[source] 似乎正是您要查找的内容。

def uuid_from_time(time_arg, node=None, clock_seq=None):
    """
    Converts a datetime or timestamp to a type 1 :class:`uuid.UUID`.

    :param time_arg:
      The time to use for the timestamp portion of the UUID.
      This can either be a :class:`datetime` object or a timestamp
      in seconds (as returned from :meth:`time.time()`).
    :type datetime: :class:`datetime` or timestamp

    :param node:
      None integer for the UUID (up to 48 bits). If not specified, this
      field is randomized.
    :type node: long

    :param clock_seq:
      Clock sequence field for the UUID (up to 14 bits). If not specified,
      a random sequence is generated.
    :type clock_seq: int

    :rtype: :class:`uuid.UUID`

    """
    if hasattr(time_arg, 'utctimetuple'):
        seconds = int(calendar.timegm(time_arg.utctimetuple()))
        microseconds = (seconds * 1e6) + time_arg.time().microsecond
    else:
        microseconds = int(time_arg * 1e6)

    # 0x01b21dd213814000 is the number of 100-ns intervals between the
    # UUID epoch 1582-10-15 00:00:00 and the Unix epoch 1970-01-01 00:00:00.
    intervals = int(microseconds * 10) + 0x01b21dd213814000

    time_low = intervals & 0xffffffff
    time_mid = (intervals >> 32) & 0xffff
    time_hi_version = (intervals >> 48) & 0x0fff

    if clock_seq is None:
        clock_seq = random.getrandbits(14)
    else:
        if clock_seq > 0x3fff:
            raise ValueError('clock_seq is out of range (need a 14-bit value)')

    clock_seq_low = clock_seq & 0xff
    clock_seq_hi_variant = 0x80 | ((clock_seq >> 8) & 0x3f)

    if node is None:
        node = random.getrandbits(48)

    return uuid.UUID(fields=(time_low, time_mid, time_hi_version,
                             clock_seq_hi_variant, clock_seq_low, node), version=1)

没有任何 Cassandra 特定于 Type 1 UUID...

Answer 2

为什么 uuid.uuid1 不是连续的

uuid.uuid1(node=None, clock_seq=None) 实际上是：

60 位时间戳（表示 1582-10-15 00:00:00 之后的 100 ns 间隔数）
14 位 "clock sequence"
48位的"Node info"（从网卡的mac-address或主机名或RNG生成）。

如果您不提供任何参数，则会调用系统函数生成 uuid。在那种情况下：

不清楚 "clock sequence" 是连续的还是随机的。
尚不清楚在多个进程中使用是否安全（clock_seq可以在不同进程中重复使用吗？）。在 Python 3.7 this info is now available.

如果您提供 clock_seq 或 node，则 "pure python implementation is used"。在这种情况下，即使 "fixed value" for clock_seq:

即使在线程执行中，当前进程中的所有调用的时间戳部分也是 guaranteed to be sequential。
clock_seq部分随机生成。但这不再重要，因为时间戳是连续且唯一的。
它对多个进程不安全（如果在 "same 100-ns time interval" 期间调用 uuid1 使用相同 clock_seq, node 的进程可能会 return 冲突值）

重用的解决方案`uuid.uuid1`

很容易看出，您可以通过提供 clock_seq 或 node 参数（使用 python 实现）使 uuid1 顺序化。

import time

from uuid import uuid1, getnode

_my_clock_seq = getrandbits(14)
_my_node = getnode()


def sequential_uuid(node=None):
    return uuid1(node=node, clock_seq=_my_clock_seq)
    # .hex attribute of this value is 32-characters long string


def alt_sequential_uuid(clock_seq=None):
    return uuid1(node=_my_node, clock_seq=clock_seq)


if __name__ == '__main__':
    from itertools import count
    old_n = uuid1()  # "Native"
    old_s = sequential_uuid()  # Sequential

    native_conflict_index = None

    t_0 = time.time()

    for x in count():
        new_n = uuid1()
        new_s = sequential_uuid()

        if old_n > new_n and not native_conflict_index:
            native_conflict_index = x

        if old_s >= new_s:
            print("OOops: non-sequential results for `sequential_uuid()`")
            break

        if (x >= 10*0x3fff and time.time() - t_0 > 30) or (native_conflict_index and x > 2*native_conflict_index):
            print('No issues for `sequential_uuid()`')
            break

        old_n = new_n
        old_s = new_s

    print(f'Conflicts for `uuid.uuid1()`: {bool(native_conflict_index)}')

多进程问题

但是如果你是运行同一台机器上的一些并行进程，那么：

node 默认为 uuid.get_node() 对于所有进程都是一样的；
clock_seq 有很小的机会在某些过程中相同（1/16384 的机会）

这可能会导致冲突！这是使用的普遍关注 uuid.uuid1 在同一台机器上的并行进程中，除非您可以从 Python3.7.

访问 SafeUUID

如果您确保还为运行此代码的每个并行进程将 node 设置为唯一值，则不应发生冲突。

即使您使用的是 SafeUUID，并设置了唯一性 node，如果它们是在不同进程中生成的，仍然可能有 non-sequential（但唯一）ID。

如果一些 lock-related 开销是可以接受的，那么你可以将 clock_seq 存储在一些外部原子存储中（例如在 "locked" 文件中）并在每次调用时增加它：这允许在所有并行进程上为 node 设置相同的值，并且也会使 id-s 顺序。对于所有并行进程都是使用 multiprocessing 创建的子进程的情况：clock_seq 可以是 "shared" 使用 multiprocessing.Value

因此你要时刻记住：

如果你是运行同一台机器上的多个进程，那么你必须：
- 确保 node 的唯一性。此解决方案的问题：您无法确定在相同的 100 ns 间隔内生成来自不同进程的顺序 ID。但这是非常 "light" 在进程启动时执行一次的操作，并通过以下方式实现： "adding" 默认节点的某些内容，例如int(time.time()*1e9) - 0x118494406d1cc000，或者通过从 machine-level 原子数据库中添加一些计数器。
- 确保"machine-level atomic clock_seq"和一台机器上所有进程相同node。这样你就会有一些 "locking" clock_seq 的开销，但是 id-s 保证是连续的，即使在相同的 100 ns 间隔期间在不同的进程中生成（除非你正在调用 uuid来自同一进程中的多个线程）。
对于不同机器上的进程：
- 要么你必须使用一些"global counter service";
- 或者不可能在相同的 100 纳秒间隔内在不同机器上生成顺序 ID。

减小 id 的大小

生成 UUID 的一般方法是 quite simple，因此很容易从头开始实现类似的东西，例如 node_info 部分使用更少的位：

import time
from random import getrandbits

_my_clock_seq = getrandbits(14)
_last_timestamp_part = 0
_used_clock_seq = 0


timestamp_multiplier = 1e7  # I'd recommend to use this value

# Next values are enough up to year 2116:
if timestamp_multiplier == 1e9:
    time_bits = 62  # Up to year 2116, also reduces chances for non-sequential id-s generated in different processes
elif timestamp_multiplier == 1e8:
    time_bits = 60  # up to year 2335
elif timestamp_multiplier == 1e7:
    time_bits = 56  # Up to year 2198.
else:
    raise ValueError('Please calculate and set time_bits')

time_mask = 2**time_bits - 1

seq_bits = 16
seq_mask = 2**seq_bits - 1

node_bits = 12
node_mask = 2**node_bits - 1

max_hex_len = len(hex(2**(node_bits+seq_bits+time_bits) - 1)) - 2  # 21

_default_node_number = getrandbits(node_bits)  # or `uuid.getnode() & node_mask`


def sequential_uuid(node_number=None):
    """Return 21-characters long hex string that is sequential and unique for each call in current process.

    Results from different processes may "overlap" but are guaranteed to
    be unique if `node_number` is different in each process.

    """
    global _my_clock_seq
    global _last_timestamp_part
    global _used_clock_seq
    if node_number is None:
        node_number = _default_node_number
    if not 0 <= node_number <= node_mask:
        raise ValueError("Node number out of range")

    timestamp_part = int(time.time() * timestamp_multiplier) & time_mask
    _my_clock_seq = (_my_clock_seq + 1) & seq_mask

    if _last_timestamp_part >= timestamp_part:
        timestamp_part = _last_timestamp_part
        if _used_clock_seq == _my_clock_seq:
            timestamp_part = (timestamp_part + 1) & time_mask
    else:
        _used_clock_seq = _my_clock_seq

    _last_timestamp_part = timestamp_part

    return hex(
        (timestamp_part << (node_bits+seq_bits))
        |
        (_my_clock_seq << (node_bits))
        |
        node_number
    )[2:]

备注：

也许在数据库中简单地存储整数值（不是hex-string）会更好
如果您将其存储为 text/char，那么最好将整数转换为 base64 字符串，而不是将其转换为 hex-string。这样它会更短（21 个字符 hex-string → 16 个字符 b64 编码字符串）：

from base64 import b64encode

total_bits = time_bits+seq_bits+node_bits
total_bytes = total_bits // 8 + 1 * bool(total_bits % 8)

def int_to_b64(int_value):
    return b64encode(int_value.to_bytes(total_bytes, 'big'))

碰撞几率

单进程：不可能发生冲突
多个进程手动设置 unique clock_seq or unique node 在每个进程中：不可能发生冲突
多个进程随机设置node（48位，时间"fixed"）：
- 在多个进程中发生 node 冲突的几率：
  - 10000 个进程中有 2 个：~0.000018%
  - 100000 个进程中有 2 个：0.0018%
- 在 2 个进程中每秒有一次 id 与 "colliding" 发生冲突的几率node:
  - 对于 "timestamp" 100 ns 的间隔（uuid.uuid1 的默认值，在我的代码中 timestamp_multiplier == 1e7）：与 3.72e-19 * avg_call_frequency²[ 成正比=58=]
  - 对于 "timestamp" 10 ns 的间隔 (timestamp_multiplier == 1e8)：与 3.72e-21 * avg_call_frequency²

Answer 3

您应该能够以 32 位对 135 年的时间范围内精确到秒的时间戳进行编码。这将只需要 8 个字符以十六进制表示。添加到 uuid 的十六进制表示形式（32 个十六进制字符），将仅包含 40 个十六进制字符。

以这种方式编码时间戳需要您选择一个基准年（例如 2000 年）并计算到当前日期（时间戳）为止的天数。将此天数乘以 86400，然后加上自午夜以来的秒数。在您到达 2135 年之前，这将为您提供小于 2^32 的值。

请注意，您必须在时间戳前缀的十六进制编码形式中保留前导零，以便字母数字排序保留时间顺序。

在时间戳中增加几位，您可以增加时间范围 and/or 精度。如果再增加 8 位（两个十六进制字符），您最多可以使用 270 年，精确到百分之一秒。
请注意，您不必对以 10 为基数的秒数建模。通过将相同数量的字符分解为 128 分而不是 100 分，您将获得最佳的位使用率。随着年份范围的翻倍，这仍然适合 8 位（2 个十六进制字符）

时间精度（即每秒或每 100 秒或 128 秒）内的碰撞概率由 uuid 的范围驱动，因此对于所选精度，它将是 2^128 中的 1。提高时间戳的精度对减少碰撞机会的影响最大。它也是对密钥总大小影响最小的因素。

更高效的字符编码：27 到 29 个字符键

您可以通过将密钥编码为 64 位而不是 16 位来显着减小密钥的大小，这会给您 27 到 29 个字符（取决于您选择的精度）

请注意，对于时间戳部分，您需要使用一个编码函数，该函数将整数作为输入并保留数字字符的整理顺序。

例如：

def encode64(number, size):
    chars  = "+-0123456789ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz"
    result = list()
    for _ in range(size):
        result.append(chars[number%64])
        number //= 64
    return "".join(reversed(result))

a = encode64(1234567890,6) # '-7ZU9G'
b = encode64(9876543210,6) # '7Ag-Pe'
print(a < b)               # True

u = encode64(int(uuid.uuid4()),22) # '1QA2LtMg30ztnugxaokVMk'

key = a+u  # '-7ZU9G1QA2LtMg30ztnugxaokVMk'  (28 characters)

您可以在编码之前将时间戳和 uuid 组合成一个数字，而不是连接两个编码值，从而节省更多字符。

encode64() 函数每 6 位需要一个字符。

所以，135 年精确到秒：(32+128)/6 = 26.7 --> 27 个字符

而不是 (32/6 = 5.3 --> 6) + (128/6 = 21.3 --> 22) ==> 28 个字符

uid       = uuid.uuid4()
timeStamp = daysSince2000 * 86400 + int(secondsSinceMidnight)
key       = encode64( timeStamp<<128 | int(uid) ,27)

270 年跨度和 128 秒精度：(40+128)/6 = 28 个字符

uid       = uuid.uuid4()
timeStamp = daysSince2000 * 86400 + int(secondsSinceMidnight)
precision = 128
timeStamp = timeStamp * precision + int(factionOfSecond * precision)
key       = encode64( timeStamp<<128 | int(uid) ,28)

使用 29 个字符，您可以将精度提高到秒的 1024th，年份范围提高到 2160 年.

UUID 掩码：17 到 19 个字符键

为了更加高效，您可以去掉 uuid 的前 64 位（这已经是一个时间戳）并将其与您自己的时间戳结合起来。这将为您提供长度为 17 到 19 个字符 的密钥，并且几乎不会损失避免碰撞（取决于您选择的精度）。

mask  = (1<<64)-1
key   = encode64( timeStamp<<64 | (int(uid) & mask) ,19)

Integer/Numeric 键 ?

最后一点，如果您的数据库支持非常大的整数或数字字段（140 位或更多）作为键，您不必将组合数字转换为字符串。直接用它作为key就行了。 timeStamp<<128 | int(uid)的数字顺序将遵循时间顺序。

如何在Python中生成一个按时间排序的uid？

How to generate a time-ordered uid in Python?

python

uuid

python-3.x

amazon-dynamodb

为什么 uuid.uuid1 不是连续的

重用的解决方案`uuid.uuid1`

多进程问题

减小 id 的大小

碰撞几率

如何在Python中生成一个按时间排序的uid？

How to generate a time-ordered uid in Python?

python

uuid

python-3.x

amazon-dynamodb

为什么 uuid.uuid1 不是连续的

重用的解决方案uuid.uuid1

多进程问题

减小 id 的大小

碰撞几率

重用的解决方案`uuid.uuid1`