如何将 Windows-1251 文本转换为可读的文本?
How do I convert a Windows-1251 text to something readable?
我有一个字符串,它由 Jericho HTML 解析器返回并包含一些俄语文本。根据source.getEncoding()
和各自HTML文件的header,编码为Windows-1251.
如何将此字符串转换为可读的内容?
我试过这个:
import java.io.UnsupportedEncodingException;
public class Program {
public void run() throws UnsupportedEncodingException {
final String windows1251String = getWindows1251String();
System.out.println("String (Windows-1251): " + windows1251String);
final String readableString = convertString(windows1251String);
System.out.println("String (converted): " + readableString);
}
private String convertString(String windows1251String) throws UnsupportedEncodingException {
return new String(windows1251String.getBytes(), "UTF-8");
}
private String getWindows1251String() {
final byte[] bytes = new byte[] {32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, -17, -65, -67, -17, -65, -67, -17, -65, -67, -17, -65, -67, -17, -65, -67, -17, -65, -67, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32};
return new String(bytes);
}
public static void main(final String[] args) throws UnsupportedEncodingException {
final Program program = new Program();
program.run();
}
}
变量 bytes
包含我的调试器中显示的数据,它是 net.htmlparser.jericho.Element.getContent().toString().getBytes()
的结果。我只是在此处复制并粘贴该数组。
这不起作用 - readableString
包含垃圾。
我该如何解决,我。 e.确保 Windows-1251 字符串被正确解码?
更新 1 (30.07.2015 12:45 MSK): 当将 convertString
中的调用编码更改为 Windows-1251
时,没有变化。请参阅下面的屏幕截图。
更新二: 再次尝试:
更新 3 (30.07.2015 14:38): 我需要解码的文本对应于下面显示的 drop-down 列表中的文本。
更新 4 (30.07.2015 14:41): 编码检测器(代码见下文)表示编码不是 Windows-1251
,而是 UTF-8
.
public static String guessEncoding(byte[] bytes) {
String DEFAULT_ENCODING = "UTF-8";
org.mozilla.universalchardet.UniversalDetector detector =
new org.mozilla.universalchardet.UniversalDetector(null);
detector.handleData(bytes, 0, bytes.length);
detector.dataEnd();
String encoding = detector.getDetectedCharset();
System.out.println("Detected encoding: " + encoding);
detector.reset();
if (encoding == null) {
encoding = DEFAULT_ENCODING;
}
return encoding;
}
(鉴于更新我删除了我原来的答案并重新开始)
出现的文字
пїЅпїЅпїЅпїЅпїЅпїЅ
是对这些字节值的准确解码
-17, -65, -67, -17, -65, -67, -17, -65, -67, -17, -65, -67, -17, -65, -67, -17, -65, -67
(两端补32,即space。)
所以要么
1) 文本是垃圾或者
2) 文本应该看起来像这样或者
3)编码不是Windows-1215
这一行明显错误
return new String(windows1251String.getBytes(), "UTF-8");
从字符串中提取字节并从中构造新字符串并不是编码之间 "converting" 的一种方式。输入字符串和输出字符串在内部都使用 UTF-16 编码(通常您甚至不需要知道或关心它)。只有当文本数据存储在字符串对象之外时,其他编码才会发挥作用——即在您的初始字节数组中。转换在构造 String 时发生,然后完成。没有从一种字符串类型到另一种字符串类型的转换——它们都是一样的。
这个
的事实
return new String(bytes);
与此相同
return new String(bytes, "Windows-1251");
建议 Windows-1251 是平台默认编码。 (您的时区为 MSK 进一步支持了这一点)
我通过修改从网站读取文本的代码解决了这个问题。
private String readContent(final String urlAsString) {
final StringBuilder content = new StringBuilder();
BufferedReader reader = null;
InputStream inputStream = null;
try {
final URL url = new URL(urlAsString);
inputStream = url.openStream();
reader =
new BufferedReader(new InputStreamReader(inputStream);
String inputLine;
while ((inputLine = reader.readLine()) != null) {
content.append(inputLine);
}
} catch (final IOException exception) {
exception.printStackTrace();
} finally {
IOUtils.closeQuietly(reader);
IOUtils.closeQuietly(inputStream);
}
return content.toString();
}
我改线了
new BufferedReader(new InputStreamReader(inputStream);
到
new BufferedReader(new InputStreamReader(inputStream, "Windows-1251"));
然后成功了。
只是为了确保您 100% 理解 java 如何处理 char
和 byte
。
byte[] input = new byte[1];
// values > 127 become negative when you put them in an array.
input[0] = (byte)239; // the array contains value -17 now.
// but all 255 values are preserved.
// But if you cast them to integers, you should use their unsigned value.
// (casting alone isn't enough).
int output = input[0] & 0xFF; // output is 239 again
// you shouldn't cast directly from a single-byte to a char.
// because: char is 16-bit ; but you only want to use 1 byte ; unfortunately your negative values will be applied in the 2nd byte, and break it.
char corrupted = (char) input[0]; // char-code: 65519 (2 bytes are used)
char corrupted = (char) ((int)input[0]); // char-code: 66519 (2 bytes are used)
// just casting to an integer/character is ok for values < 0x7F though
// values < 0x7F are always positive, even when casted to byte
// AND the first 7-bits in any ascii-encodings (e.g. windows-1251) are identical.
byte simple = (byte) 'a';
char chr = (char) ascii_LT_7F; // will result in 'a' again
// But it's still more reliable to use the & 0xFF conversion.
// Because it ensures that your character can never be greater than char code 255 (a single byte), even when the byte is unexpectedly negative (> 0x7F).
char chr = (char) ((byte)simple & 0xFF); // also results in 'a'
// for value 239 (which is 0xEF) it's impossible though.
// a java char is 16-bit encoded internally, following the unicode character set.
// characters 0x00 to 0x7F are identical in most encodings.
// but e.g. 0xEF in windows-1251 does not match 0xEF in UTF-16.
// so, this is a bad idea.
char corrupted = (char) (input[0] & 0xFF);
// And that's something you can only fix by using encodings.
// It's good practice to use encodings really just ALWAYS.
// the encoding indicates what your bytes[] are encoded in NOW.
// your bytes will be converted to 16-bit characters.
String text = new String(bytes, "from-encoding");
// if you want to change that text back to bytes, use an encoding !!
// this time the encoding specifies is the TARGET-ENCODING.
byte[] bytes = text.getBytes("to-encoding");
希望对您有所帮助。
显示值:
我可以确认 byte[] 显示正确。我在 Windows-1251 代码页中检查了它们。 (byte -17 = int 239 = 0xEF = char 'п')
换句话说,您的字节值不正确,或者是不同的源编码。
我有一个字符串,它由 Jericho HTML 解析器返回并包含一些俄语文本。根据source.getEncoding()
和各自HTML文件的header,编码为Windows-1251.
如何将此字符串转换为可读的内容?
我试过这个:
import java.io.UnsupportedEncodingException;
public class Program {
public void run() throws UnsupportedEncodingException {
final String windows1251String = getWindows1251String();
System.out.println("String (Windows-1251): " + windows1251String);
final String readableString = convertString(windows1251String);
System.out.println("String (converted): " + readableString);
}
private String convertString(String windows1251String) throws UnsupportedEncodingException {
return new String(windows1251String.getBytes(), "UTF-8");
}
private String getWindows1251String() {
final byte[] bytes = new byte[] {32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, -17, -65, -67, -17, -65, -67, -17, -65, -67, -17, -65, -67, -17, -65, -67, -17, -65, -67, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32};
return new String(bytes);
}
public static void main(final String[] args) throws UnsupportedEncodingException {
final Program program = new Program();
program.run();
}
}
变量 bytes
包含我的调试器中显示的数据,它是 net.htmlparser.jericho.Element.getContent().toString().getBytes()
的结果。我只是在此处复制并粘贴该数组。
这不起作用 - readableString
包含垃圾。
我该如何解决,我。 e.确保 Windows-1251 字符串被正确解码?
更新 1 (30.07.2015 12:45 MSK): 当将 convertString
中的调用编码更改为 Windows-1251
时,没有变化。请参阅下面的屏幕截图。
更新二: 再次尝试:
更新 3 (30.07.2015 14:38): 我需要解码的文本对应于下面显示的 drop-down 列表中的文本。
更新 4 (30.07.2015 14:41): 编码检测器(代码见下文)表示编码不是 Windows-1251
,而是 UTF-8
.
public static String guessEncoding(byte[] bytes) {
String DEFAULT_ENCODING = "UTF-8";
org.mozilla.universalchardet.UniversalDetector detector =
new org.mozilla.universalchardet.UniversalDetector(null);
detector.handleData(bytes, 0, bytes.length);
detector.dataEnd();
String encoding = detector.getDetectedCharset();
System.out.println("Detected encoding: " + encoding);
detector.reset();
if (encoding == null) {
encoding = DEFAULT_ENCODING;
}
return encoding;
}
(鉴于更新我删除了我原来的答案并重新开始)
出现的文字
пїЅпїЅпїЅпїЅпїЅпїЅ
是对这些字节值的准确解码
-17, -65, -67, -17, -65, -67, -17, -65, -67, -17, -65, -67, -17, -65, -67, -17, -65, -67
(两端补32,即space。)
所以要么
1) 文本是垃圾或者
2) 文本应该看起来像这样或者
3)编码不是Windows-1215
这一行明显错误
return new String(windows1251String.getBytes(), "UTF-8");
从字符串中提取字节并从中构造新字符串并不是编码之间 "converting" 的一种方式。输入字符串和输出字符串在内部都使用 UTF-16 编码(通常您甚至不需要知道或关心它)。只有当文本数据存储在字符串对象之外时,其他编码才会发挥作用——即在您的初始字节数组中。转换在构造 String 时发生,然后完成。没有从一种字符串类型到另一种字符串类型的转换——它们都是一样的。
这个
的事实return new String(bytes);
与此相同
return new String(bytes, "Windows-1251");
建议 Windows-1251 是平台默认编码。 (您的时区为 MSK 进一步支持了这一点)
我通过修改从网站读取文本的代码解决了这个问题。
private String readContent(final String urlAsString) {
final StringBuilder content = new StringBuilder();
BufferedReader reader = null;
InputStream inputStream = null;
try {
final URL url = new URL(urlAsString);
inputStream = url.openStream();
reader =
new BufferedReader(new InputStreamReader(inputStream);
String inputLine;
while ((inputLine = reader.readLine()) != null) {
content.append(inputLine);
}
} catch (final IOException exception) {
exception.printStackTrace();
} finally {
IOUtils.closeQuietly(reader);
IOUtils.closeQuietly(inputStream);
}
return content.toString();
}
我改线了
new BufferedReader(new InputStreamReader(inputStream);
到
new BufferedReader(new InputStreamReader(inputStream, "Windows-1251"));
然后成功了。
只是为了确保您 100% 理解 java 如何处理 char
和 byte
。
byte[] input = new byte[1];
// values > 127 become negative when you put them in an array.
input[0] = (byte)239; // the array contains value -17 now.
// but all 255 values are preserved.
// But if you cast them to integers, you should use their unsigned value.
// (casting alone isn't enough).
int output = input[0] & 0xFF; // output is 239 again
// you shouldn't cast directly from a single-byte to a char.
// because: char is 16-bit ; but you only want to use 1 byte ; unfortunately your negative values will be applied in the 2nd byte, and break it.
char corrupted = (char) input[0]; // char-code: 65519 (2 bytes are used)
char corrupted = (char) ((int)input[0]); // char-code: 66519 (2 bytes are used)
// just casting to an integer/character is ok for values < 0x7F though
// values < 0x7F are always positive, even when casted to byte
// AND the first 7-bits in any ascii-encodings (e.g. windows-1251) are identical.
byte simple = (byte) 'a';
char chr = (char) ascii_LT_7F; // will result in 'a' again
// But it's still more reliable to use the & 0xFF conversion.
// Because it ensures that your character can never be greater than char code 255 (a single byte), even when the byte is unexpectedly negative (> 0x7F).
char chr = (char) ((byte)simple & 0xFF); // also results in 'a'
// for value 239 (which is 0xEF) it's impossible though.
// a java char is 16-bit encoded internally, following the unicode character set.
// characters 0x00 to 0x7F are identical in most encodings.
// but e.g. 0xEF in windows-1251 does not match 0xEF in UTF-16.
// so, this is a bad idea.
char corrupted = (char) (input[0] & 0xFF);
// And that's something you can only fix by using encodings.
// It's good practice to use encodings really just ALWAYS.
// the encoding indicates what your bytes[] are encoded in NOW.
// your bytes will be converted to 16-bit characters.
String text = new String(bytes, "from-encoding");
// if you want to change that text back to bytes, use an encoding !!
// this time the encoding specifies is the TARGET-ENCODING.
byte[] bytes = text.getBytes("to-encoding");
希望对您有所帮助。
显示值: 我可以确认 byte[] 显示正确。我在 Windows-1251 代码页中检查了它们。 (byte -17 = int 239 = 0xEF = char 'п')
换句话说,您的字节值不正确,或者是不同的源编码。