从 InputStreamReader 中提取文本在 UTF-8 中不起作用
Extracting text from InputStreamReader not working in UTF-8
我正在尝试阅读以下 API 文本页面:
使用 InputStreamReader,我想提取文本并逐行打印。
问题是文本格式无法识别为 UTF-8。所以输出看起来很丑:
????
方法代码如下:
String testURL = "https://api.stackexchange.com/2.2/users?page=1&pagesize=9&fromdate=1221436800&todate=1523318400&order=desc&min=1&max=2000000&sort=reputation&site=Whosebug";
URL url = null;
try
{
url = new URL(testURL);
} catch (MalformedURLException e1)
{
e1.printStackTrace();
}
InputStream is = null;
try
{
is = url.openStream();
} catch (IOException e1)
{
e1.printStackTrace();
}
try (BufferedReader br = new BufferedReader(new InputStreamReader(is, "ISO-8859-1")))
{
String line;
while ((line = br.readLine()) != null)
{
System.out.println(line);
}
} catch (MalformedURLException e)
{
e.printStackTrace();
} catch (IOException e)
{
e.printStackTrace();
}
我试过换行
try (BufferedReader br = new BufferedReader(new InputStreamReader(is, "UTF-8")))
至
try (BufferedReader br = new BufferedReader(new InputStreamReader(is, StandardCharsets.UTF_8)))
或到
try (BufferedReader br = new BufferedReader(new InputStreamReader(is, "ISO-8859-1")))
不幸的是,问题仍然存在。我真的很感激任何提示,所以我可以解决这个问题。谢谢。
为了分析您的问题,我尝试通过 curl
从给定的 URL 下载
(使用选项 -i
查看 HTTP 响应 header 行)并得到:
Cache-Control: private
Content-Type: application/json; charset=utf-8
Content-Encoding: gzip
Access-Control-Allow-Origin: *
Access-Control-Allow-Methods: GET, POST
Access-Control-Allow-Credentials: false
X-Content-Type-Options: nosniff
Date: Sat, 21 Apr 2018 21:48:42 GMT
Content-Length: 85
▒VJ-*▒/▒▒LQ▒210ЁrsS▒▒▒S▒▒▒▒3KR2▒▒R
K3▒RS▒`J▒sA▒I▒)▒▒E@NIj▒R-g▒▒PP^C
行Content-Encoding: gzip
告诉你内容是gzip-compressed.
因此,在您的 Java 程序中,您需要 gzip-uncompress 内容。
您可以简单地通过替换行
来做到这一点
is = url.openStream();
和
is = new GZIPInputStream(url.openStream());
更好的方法是获取实际的 Content-Encoding
并根据该决定是否需要解压缩内容:
URLConnection connection = url.openConnection();
is = connection.getInputStream();
String contentEncoding = connection.getContentEncoding();
if (contentEncoding.equals("gzip"))
is = new GZIPInputStream(is);
我正在尝试阅读以下 API 文本页面:
使用 InputStreamReader,我想提取文本并逐行打印。
问题是文本格式无法识别为 UTF-8。所以输出看起来很丑: ????
方法代码如下:
String testURL = "https://api.stackexchange.com/2.2/users?page=1&pagesize=9&fromdate=1221436800&todate=1523318400&order=desc&min=1&max=2000000&sort=reputation&site=Whosebug";
URL url = null;
try
{
url = new URL(testURL);
} catch (MalformedURLException e1)
{
e1.printStackTrace();
}
InputStream is = null;
try
{
is = url.openStream();
} catch (IOException e1)
{
e1.printStackTrace();
}
try (BufferedReader br = new BufferedReader(new InputStreamReader(is, "ISO-8859-1")))
{
String line;
while ((line = br.readLine()) != null)
{
System.out.println(line);
}
} catch (MalformedURLException e)
{
e.printStackTrace();
} catch (IOException e)
{
e.printStackTrace();
}
我试过换行
try (BufferedReader br = new BufferedReader(new InputStreamReader(is, "UTF-8")))
至
try (BufferedReader br = new BufferedReader(new InputStreamReader(is, StandardCharsets.UTF_8)))
或到
try (BufferedReader br = new BufferedReader(new InputStreamReader(is, "ISO-8859-1")))
不幸的是,问题仍然存在。我真的很感激任何提示,所以我可以解决这个问题。谢谢。
为了分析您的问题,我尝试通过 curl
从给定的 URL 下载
(使用选项 -i
查看 HTTP 响应 header 行)并得到:
Cache-Control: private
Content-Type: application/json; charset=utf-8
Content-Encoding: gzip
Access-Control-Allow-Origin: *
Access-Control-Allow-Methods: GET, POST
Access-Control-Allow-Credentials: false
X-Content-Type-Options: nosniff
Date: Sat, 21 Apr 2018 21:48:42 GMT
Content-Length: 85
▒VJ-*▒/▒▒LQ▒210ЁrsS▒▒▒S▒▒▒▒3KR2▒▒R
K3▒RS▒`J▒sA▒I▒)▒▒E@NIj▒R-g▒▒PP^C
行Content-Encoding: gzip
告诉你内容是gzip-compressed.
因此,在您的 Java 程序中,您需要 gzip-uncompress 内容。
您可以简单地通过替换行
is = url.openStream();
和
is = new GZIPInputStream(url.openStream());
更好的方法是获取实际的 Content-Encoding 并根据该决定是否需要解压缩内容:
URLConnection connection = url.openConnection();
is = connection.getInputStream();
String contentEncoding = connection.getContentEncoding();
if (contentEncoding.equals("gzip"))
is = new GZIPInputStream(is);