将 char 从 CP437 编码转换为 UTF-8 编码总是产生相同的字符代码,因此不是相同的字符
Converting char from CP437 encoding to UTF-8 encoding always yields the same character code, thus not the same character
问题
我正在尝试从 CP437 encoding to UTF-8 (Encoding.UTF8
) 转换字符 and/or 字节数组。问题是无论我尝试什么,代码总是产生相同的字符代码,但由于两种编码有不同的字符集映射到字符代码,因此生成的 char 是不一样的。
举个例子,我正在尝试将字符代码为 3 的字符从 CP437(一颗心:♥
)转换为 UTF-8,但我仍然希望它是同一个字符。但是,当转换为 UTF-8 时,它仍然使用字符代码 3,这会导致一个名为 ETX (see UTF-8's codepage layout 的控制字符用于字符列表)。
我的尝试
以下是我的一些尝试:
(通用代码)
Public Shared ReadOnly CP437 As Encoding = Encoding.GetEncoding("IBM437")
Public Shared ReadOnly BytesToConvert As Byte() = New Byte(3 - 1) {3, 4, 5} 'Characters: ♥, ♦, ♣.
Public Sub DebugEncodedArray(ByVal Bytes As Byte(), ByVal Encoding As Encoding)
Dim ResultingString As String = Encoding.GetString(Bytes)
MessageBox.Show( _
String.Format("Encoding: {1}{0}" & _
"String: ""{2}""{0}" & _
"Bytes: {{{3}}}{0}", _
Environment.NewLine, _
Encoding.EncodingName, _
ResultingString, _
String.Join(", ", Bytes)), _
"Debug", MessageBoxButtons.OK, MessageBoxIcon.Information _
)
End Sub
Dim ConvertedBytes As Byte() = Encoding.Convert(CP437, Encoding.UTF8, BytesToConvert)
DebugEncodedArray(ConvertedBytes, Encoding.UTF8)
使用具有特定编码的 StreamWriter
, writing to a MemoryStream
:
Using MStream As New MemoryStream(16)
Using Writer As New StreamWriter(MStream, CP437)
Writer.Write(CP437.GetChars(BytesToConvert))
End Using
Dim UTF8Bytes As Byte() = Encoding.Convert(CP437, Encoding.UTF8, MStream.ToArray())
DebugEncodedArray(UTF8Bytes, Encoding.UTF8)
End Using
写入文件,然后读取它并转换字节(对于我需要此代码的用途来说不是最佳选择):
File.WriteAllText("C:\Users\Vincent\Desktop\test.txt", CP437.GetString(BytesToConvert), CP437)
Dim FileBytes As Byte() = File.ReadAllBytes("C:\Users\Vincent\Desktop\test.txt")
Dim UTF8Bytes As Byte() = Encoding.Convert(CP437, Encoding.UTF8, FileBytes)
DebugEncodedArray(UTF8Bytes, Encoding.UTF8)
结果
以上所有尝试都给出了相同的结果:
如果我将 CP437
传递给 DebugEncodedArray()
而不是 Encoding.UTF8
:
预期结果
我期待的结果是:
Dim UTF8Bytes As Byte() = Encoding.UTF8.GetBytes("♥♦♣")
DebugEncodedArray(UTF8Bytes, Encoding.UTF8)
关于我做错了什么的任何线索?
CP437 的低范围是上下文相关的。我认为您已经证明,对于 1-31 和 127,您将需要一个简单的查找,因为 .Net 在控制代码上下文中而不是在图形上下文中解释它们 - 即 ◙ (0xA
) 是 \n
不是该图形的等效 Unicode 代码点。
(供以后的读者参考)这就是我最终通过Alex K.的建议解决了我的问题:
Dim Heart As Char = Convert.ToChar(CP437LookupTable(3)) 'Results in: ♥. YAY!
查找table:
'Lookup table for Codepage 437-to-Unicode character codes.
Private Shared ReadOnly CP437LookupTable As Integer() = _
New Integer(256 - 1) { _
0, 9786, 9787, 9829, 9830, 9827, 9824, _
8226, 9688, 9675, 9689, 9794, 9792, 9834, 9835, _
9788, 9658, 9668, 8597, 8252, 182, 167, 9644, _
8616, 8593, 8595, 8594, 8592, 8735, 8596, 9650, _
9660, 32, 33, 34, 35, 36, 37, 38, _
39, 40, 41, 42, 43, 44, 45, 46, _
47, 48, 49, 50, 51, 52, 53, 54, _
55, 56, 57, 58, 59, 60, 61, 62, _
63, 64, 65, 66, 67, 68, 69, 70, _
71, 72, 73, 74, 75, 76, 77, 78, _
79, 80, 81, 82, 83, 84, 85, 86, _
87, 88, 89, 90, 91, 92, 93, 94, _
95, 96, 97, 98, 99, 100, 101, 102, _
103, 104, 105, 106, 107, 108, 109, 110, _
111, 112, 113, 114, 115, 116, 117, 118, _
119, 120, 121, 122, 123, 124, 125, 126, _
8962, 199, 252, 233, 226, 228, 224, 229, _
231, 234, 235, 232, 239, 238, 236, 196, _
197, 201, 230, 198, 244, 246, 242, 251, _
249, 255, 214, 220, 162, 163, 165, 8359, _
402, 225, 237, 243, 250, 241, 209, 170, _
186, 191, 8976, 172, 189, 188, 161, 171, _
187, 9617, 9618, 9619, 9474, 9508, 9569, 9570, _
9558, 9557, 9571, 9553, 9559, 9565, 9564, 9563, _
9488, 9492, 9524, 9516, 9500, 9472, 9532, 9566, _
9567, 9562, 9556, 9577, 9574, 9568, 9552, 9580, _
9575, 9576, 9572, 9573, 9561, 9560, 9554, 9555, _
9579, 9578, 9496, 9484, 9608, 9604, 9612, 9616, _
9600, 945, 223, 915, 960, 931, 963, 181, _
964, 934, 920, 937, 948, 8734, 966, 949, _
8745, 8801, 177, 8805, 8804, 8992, 8993, 247, _
8776, 176, 8729, 183, 8730, 8319, 178, 9632, _
160 _
}
问题
我正在尝试从 CP437 encoding to UTF-8 (Encoding.UTF8
) 转换字符 and/or 字节数组。问题是无论我尝试什么,代码总是产生相同的字符代码,但由于两种编码有不同的字符集映射到字符代码,因此生成的 char 是不一样的。
举个例子,我正在尝试将字符代码为 3 的字符从 CP437(一颗心:♥
)转换为 UTF-8,但我仍然希望它是同一个字符。但是,当转换为 UTF-8 时,它仍然使用字符代码 3,这会导致一个名为 ETX (see UTF-8's codepage layout 的控制字符用于字符列表)。
我的尝试
以下是我的一些尝试:
(通用代码)
Public Shared ReadOnly CP437 As Encoding = Encoding.GetEncoding("IBM437")
Public Shared ReadOnly BytesToConvert As Byte() = New Byte(3 - 1) {3, 4, 5} 'Characters: ♥, ♦, ♣.
Public Sub DebugEncodedArray(ByVal Bytes As Byte(), ByVal Encoding As Encoding)
Dim ResultingString As String = Encoding.GetString(Bytes)
MessageBox.Show( _
String.Format("Encoding: {1}{0}" & _
"String: ""{2}""{0}" & _
"Bytes: {{{3}}}{0}", _
Environment.NewLine, _
Encoding.EncodingName, _
ResultingString, _
String.Join(", ", Bytes)), _
"Debug", MessageBoxButtons.OK, MessageBoxIcon.Information _
)
End Sub
Dim ConvertedBytes As Byte() = Encoding.Convert(CP437, Encoding.UTF8, BytesToConvert)
DebugEncodedArray(ConvertedBytes, Encoding.UTF8)
使用具有特定编码的 StreamWriter
, writing to a MemoryStream
:
Using MStream As New MemoryStream(16)
Using Writer As New StreamWriter(MStream, CP437)
Writer.Write(CP437.GetChars(BytesToConvert))
End Using
Dim UTF8Bytes As Byte() = Encoding.Convert(CP437, Encoding.UTF8, MStream.ToArray())
DebugEncodedArray(UTF8Bytes, Encoding.UTF8)
End Using
写入文件,然后读取它并转换字节(对于我需要此代码的用途来说不是最佳选择):
File.WriteAllText("C:\Users\Vincent\Desktop\test.txt", CP437.GetString(BytesToConvert), CP437)
Dim FileBytes As Byte() = File.ReadAllBytes("C:\Users\Vincent\Desktop\test.txt")
Dim UTF8Bytes As Byte() = Encoding.Convert(CP437, Encoding.UTF8, FileBytes)
DebugEncodedArray(UTF8Bytes, Encoding.UTF8)
结果
以上所有尝试都给出了相同的结果:
如果我将 CP437
传递给 DebugEncodedArray()
而不是 Encoding.UTF8
:
预期结果
我期待的结果是:
Dim UTF8Bytes As Byte() = Encoding.UTF8.GetBytes("♥♦♣")
DebugEncodedArray(UTF8Bytes, Encoding.UTF8)
关于我做错了什么的任何线索?
CP437 的低范围是上下文相关的。我认为您已经证明,对于 1-31 和 127,您将需要一个简单的查找,因为 .Net 在控制代码上下文中而不是在图形上下文中解释它们 - 即 ◙ (0xA
) 是 \n
不是该图形的等效 Unicode 代码点。
(供以后的读者参考)这就是我最终通过Alex K.的建议解决了我的问题:
Dim Heart As Char = Convert.ToChar(CP437LookupTable(3)) 'Results in: ♥. YAY!
查找table:
'Lookup table for Codepage 437-to-Unicode character codes.
Private Shared ReadOnly CP437LookupTable As Integer() = _
New Integer(256 - 1) { _
0, 9786, 9787, 9829, 9830, 9827, 9824, _
8226, 9688, 9675, 9689, 9794, 9792, 9834, 9835, _
9788, 9658, 9668, 8597, 8252, 182, 167, 9644, _
8616, 8593, 8595, 8594, 8592, 8735, 8596, 9650, _
9660, 32, 33, 34, 35, 36, 37, 38, _
39, 40, 41, 42, 43, 44, 45, 46, _
47, 48, 49, 50, 51, 52, 53, 54, _
55, 56, 57, 58, 59, 60, 61, 62, _
63, 64, 65, 66, 67, 68, 69, 70, _
71, 72, 73, 74, 75, 76, 77, 78, _
79, 80, 81, 82, 83, 84, 85, 86, _
87, 88, 89, 90, 91, 92, 93, 94, _
95, 96, 97, 98, 99, 100, 101, 102, _
103, 104, 105, 106, 107, 108, 109, 110, _
111, 112, 113, 114, 115, 116, 117, 118, _
119, 120, 121, 122, 123, 124, 125, 126, _
8962, 199, 252, 233, 226, 228, 224, 229, _
231, 234, 235, 232, 239, 238, 236, 196, _
197, 201, 230, 198, 244, 246, 242, 251, _
249, 255, 214, 220, 162, 163, 165, 8359, _
402, 225, 237, 243, 250, 241, 209, 170, _
186, 191, 8976, 172, 189, 188, 161, 171, _
187, 9617, 9618, 9619, 9474, 9508, 9569, 9570, _
9558, 9557, 9571, 9553, 9559, 9565, 9564, 9563, _
9488, 9492, 9524, 9516, 9500, 9472, 9532, 9566, _
9567, 9562, 9556, 9577, 9574, 9568, 9552, 9580, _
9575, 9576, 9572, 9573, 9561, 9560, 9554, 9555, _
9579, 9578, 9496, 9484, 9608, 9604, 9612, 9616, _
9600, 945, 223, 915, 960, 931, 963, 181, _
964, 934, 920, 937, 948, 8734, 966, 949, _
8745, 8801, 177, 8805, 8804, 8992, 8993, 247, _
8776, 176, 8729, 183, 8730, 8319, 178, 9632, _
160 _
}