What/Where 是 BaseN(例如 Base64)编码的正式规范吗?
What/Where is the formal specification of BaseN (e.g., Base64) encodings?
TL;DR:这些编码(例如 ISO 或其他 national/international 标准)是否有正式规范之类的东西,还是主要由开发人员作为通用技术?
Star当遇到这句话时(来自this PhD thesis)想去这个兔子洞:
That is, interpreting c as a base-256 encoding of some number, with digits from least significant to most significant (i.e., a little-endian number), we print the number in base-32 with digits from most significant to least significant. (Note that the outer summation denotes string concatenation, while the inner summation denotes integer addition.) The set of digits is
digits32 = "0123456789abcdfghijklmnpqrsvwxyz"
i.e., the alphanumerics excepting the letters e, o, u, and t. This is to reduce the possibility that hash representations contain character sequences that are potentially offensive to some users (a known possibility with alphanumeric representations of numbers [11]).
我在主题方面没有太多经验,tar按以下顺序学习基础知识:
- Base64 wikipedia entry
- Binary-to-text encoding
- Base What? A Practical Introduction to Base Encoding;
- RFC 4648 - The Base16, Base32, and Base64 Data Encodings
None 其中提到了 Base256,但到目前为止,这就是我总结什么是 BaseN 编码的方式(以 非常 过于简单和草率的方式):
Encoding schemes to represent binary data in
textual format based on a set of characters
(e.g., chosen arbitrarily by developer, defined
by a standard/specification), where the size of
the set forms the base of the encoding scheme
(e.g., Base64 - 64 characters).
选择使用“任意”一词是因为 RFC4648's Base32 definition differs from the Base32 used in the paper(即字符集显然至少如此)。
至于Base256,the paper也不再提了,搜索“Base256”时,“base-256 ", ""base 256"", 等等,我只找到了实现,没有任何正式的规范。这些看起来也只是名字相似(我在上面使用“任意”这个词的另一个原因):
base256-encoding: "Base256 编码,a.k.a.latin1 编码,JavaScript 中最节省内存的编码。"
找不到太多关于“latin1 Base256 编码”的信息,但我推测该项目中的 Base256 实现使用 Latin1 character set 作为基础。
base-256: "像 gnu-tar 一样编码和解码 base256 编码(支持的范围是 -9007199254740991 到 9007199254740991)。
查找 the GNU tar
manual's "GNU Extensions to the Archive Format" section,相关段落指出(强调我的):
For fields containing numbers or timestamps that are out of range for the basic format, the GNU format uses a base-256 representation instead of an ASCII octal number. If the leading byte is 0xff (255), all the bytes of the field (including the leading byte) are concatenated in big-endian order, with the result being a negative number expressed in two’s complement form. If the leading byte is 0x80 (128), the non-leading bytes of the field are concatenated in big-endian order, with the result being a positive number expressed in binary form. Leading bytes other than 0xff, 0x80 and ASCII octal digits are reserved for future use, as are base-256 representations of values that would be in range for the basic format.
在寻找正式规范时,您通常希望寻找的是 RFC、ISO 或 IEEE 标准。 Base-N编码的规范是RFC4648.
也就是说,base-256 编码的用途与您链接的 base-N 完全不同。
Base-16 到 base-64 设计用于在我们只有有限的可用字符集时对二进制数据进行编码。引用 RFC4648:
Base encoding of data is used in many situations to store or transfer
data in environments that, perhaps for legacy reasons, are restricted
to US-ASCII [1] data. Base encoding can also be used in new
applications that do not have legacy restrictions, simply because it
makes it possible to manipulate objects with text editors.
除了 RFC 中描述的编码之外,没有 base-N 编码,因为出于实际原因,这并不重要。我们可能能够通过在给定环境中使用每个可能的允许字符来压缩更多数据,但是我们失去了很多可移植性,并且有可能在更新后破坏我们的代码。
但是,base-256 编码通常用于存储代码点。一个字节已经可以容纳 256 个不同的值,所以在某种程度上,二进制数据已经存储在 base-256 中。
代码点就是我们通常认为的字符。例如,Unicode 字符是单个代码点。但是,我们 运行 遇到的问题是我们不能只存储代码点 as-is。通常我们可以将任何代码点放入 4 个字节,但考虑到大多数语言每个字符不需要那么多 space,以这种方式存储它们的效率非常低。通常,base-256 编码是将代码点列表编码为尽可能少的字节的方法。
UTF-8 通常是最流行的编码代码点的方法,因为它为任何值提供了一个不错的解决方案,并允许我们快速区分字符,无论我们从哪里开始阅读。这是 RFC3629.
的粗略总结
Char. number range | UTF-8 octet sequence
(hexadecimal) | (binary)
--------------------+---------------------------------------------
0000 0000-0000 007F | 0xxxxxxx
0000 0080-0000 07FF | 110xxxxx 10xxxxxx
0000 0800-0000 FFFF | 1110xxxx 10xxxxxx 10xxxxxx
0001 0000-0010 FFFF | 11110xxx 10xxxxxx 10xxxxxx 10xxxxxx
TL;DR:这些编码(例如 ISO 或其他 national/international 标准)是否有正式规范之类的东西,还是主要由开发人员作为通用技术?
Star当遇到这句话时(来自this PhD thesis)想去这个兔子洞:
That is, interpreting c as a base-256 encoding of some number, with digits from least significant to most significant (i.e., a little-endian number), we print the number in base-32 with digits from most significant to least significant. (Note that the outer summation denotes string concatenation, while the inner summation denotes integer addition.) The set of digits is
digits32 = "0123456789abcdfghijklmnpqrsvwxyz"
i.e., the alphanumerics excepting the letters e, o, u, and t. This is to reduce the possibility that hash representations contain character sequences that are potentially offensive to some users (a known possibility with alphanumeric representations of numbers [11]).
我在主题方面没有太多经验,tar按以下顺序学习基础知识:
- Base64 wikipedia entry
- Binary-to-text encoding
- Base What? A Practical Introduction to Base Encoding;
- RFC 4648 - The Base16, Base32, and Base64 Data Encodings
None 其中提到了 Base256,但到目前为止,这就是我总结什么是 BaseN 编码的方式(以 非常 过于简单和草率的方式):
Encoding schemes to represent binary data in
textual format based on a set of characters
(e.g., chosen arbitrarily by developer, defined
by a standard/specification), where the size of
the set forms the base of the encoding scheme
(e.g., Base64 - 64 characters).
选择使用“任意”一词是因为 RFC4648's Base32 definition differs from the Base32 used in the paper(即字符集显然至少如此)。
至于Base256,the paper也不再提了,搜索“Base256”时,“base-256 ", ""base 256"", 等等,我只找到了实现,没有任何正式的规范。这些看起来也只是名字相似(我在上面使用“任意”这个词的另一个原因):
base256-encoding: "Base256 编码,a.k.a.latin1 编码,JavaScript 中最节省内存的编码。"
找不到太多关于“latin1 Base256 编码”的信息,但我推测该项目中的 Base256 实现使用 Latin1 character set 作为基础。
base-256: "像 gnu-tar 一样编码和解码 base256 编码(支持的范围是 -9007199254740991 到 9007199254740991)。
查找 the GNU
tar
manual's "GNU Extensions to the Archive Format" section,相关段落指出(强调我的):For fields containing numbers or timestamps that are out of range for the basic format, the GNU format uses a base-256 representation instead of an ASCII octal number. If the leading byte is 0xff (255), all the bytes of the field (including the leading byte) are concatenated in big-endian order, with the result being a negative number expressed in two’s complement form. If the leading byte is 0x80 (128), the non-leading bytes of the field are concatenated in big-endian order, with the result being a positive number expressed in binary form. Leading bytes other than 0xff, 0x80 and ASCII octal digits are reserved for future use, as are base-256 representations of values that would be in range for the basic format.
在寻找正式规范时,您通常希望寻找的是 RFC、ISO 或 IEEE 标准。 Base-N编码的规范是RFC4648.
也就是说,base-256 编码的用途与您链接的 base-N 完全不同。
Base-16 到 base-64 设计用于在我们只有有限的可用字符集时对二进制数据进行编码。引用 RFC4648:
Base encoding of data is used in many situations to store or transfer data in environments that, perhaps for legacy reasons, are restricted to US-ASCII [1] data. Base encoding can also be used in new applications that do not have legacy restrictions, simply because it makes it possible to manipulate objects with text editors.
除了 RFC 中描述的编码之外,没有 base-N 编码,因为出于实际原因,这并不重要。我们可能能够通过在给定环境中使用每个可能的允许字符来压缩更多数据,但是我们失去了很多可移植性,并且有可能在更新后破坏我们的代码。
但是,base-256 编码通常用于存储代码点。一个字节已经可以容纳 256 个不同的值,所以在某种程度上,二进制数据已经存储在 base-256 中。
代码点就是我们通常认为的字符。例如,Unicode 字符是单个代码点。但是,我们 运行 遇到的问题是我们不能只存储代码点 as-is。通常我们可以将任何代码点放入 4 个字节,但考虑到大多数语言每个字符不需要那么多 space,以这种方式存储它们的效率非常低。通常,base-256 编码是将代码点列表编码为尽可能少的字节的方法。
UTF-8 通常是最流行的编码代码点的方法,因为它为任何值提供了一个不错的解决方案,并允许我们快速区分字符,无论我们从哪里开始阅读。这是 RFC3629.
的粗略总结Char. number range | UTF-8 octet sequence (hexadecimal) | (binary) --------------------+--------------------------------------------- 0000 0000-0000 007F | 0xxxxxxx 0000 0080-0000 07FF | 110xxxxx 10xxxxxx 0000 0800-0000 FFFF | 1110xxxx 10xxxxxx 10xxxxxx 0001 0000-0010 FFFF | 11110xxx 10xxxxxx 10xxxxxx 10xxxxxx