MonotonicallyIncreasingId 输出广泛可变

MonotonicallyIncreasingId output widely variable

在使用此功能时,我可以获得较低的值 0、1、2、3 ... 返回值,或者在与 Dataframe 一起使用时返回较大的值。目前尚不清楚为什么会这样。我读到一个无法控制生成的值。

来自 source code 中的评论:

The current implementation puts the partition ID in the upper 31 bits, and the lower 33 bits represent the record number within each partition. The assumption is that the data frame has less than 1 billion partitions, and each partition has less than 8 billion records.

因此您将获得分区 0 的低值,而其他所有分区的值都非常高。

但这是您不应依赖的实现细节。只有单调递增的性质才能保证不变。

这在function doc for monotonicallyIncreasingId()

中很清楚

A column expression that generates monotonically increasing 64-bit integers. The generated ID is guaranteed to be monotonically increasing and unique, but not consecutive. The current implementation puts the partition ID in the upper 31 bits, and the record number within each partition in the lower 33 bits. The assumption is that the data frame has less than 1 billion partitions, and each partition has less than 8 billion records.

As an example, consider a DataFrame with two partitions, each with 3 records. This expression would return the following IDs: 0, 1, 2, 8589934592 (1L << 33), 8589934593, 8589934594.

希望对您有所帮助!