对未编码的 URL 执行 URL 解码器有任何顾虑吗?

Any concerns about executing URLDecoder against a URL that was not encoded?

目前正在将 URL 编码器和 URL 解码器合并到一些代码中。 已经保存了许多 URLs,这些 URLs 将由 URLDecoder 例程处理,而这些 URLDecoder 例程最初未由 URLEncoder 例程处理。

根据一些测试,似乎不会出现问题,但我没有测试所有场景。

我确实注意到一些像 / 这样通常会被编码的字符,即使最初没有被编码,也会被解码例程处理。

这让我进行了过于简单的分析。看来 URL 解码器例程本质上是检查 URL 中的 % 和接下来的 2 个字节(前提是使用了 UTF-8)。只要之前保存的 URL 中没有任何 %,那么在 URL 解码器例程处理时就不会有问题。听起来对吗?

是的,虽然它适用于 "simple" 情况,但如果为包含某些特殊字符的未编码 URL 调用 URLDecoder.decode,您可能会遇到 a) 异常或 b) 意外行为.

考虑以下示例:它将为第三次测试抛出 java.lang.IllegalArgumentException: URLDecoder: Incomplete trailing escape (%) pattern,并且它将毫无例外地为第二次测试更改 URL(而常规 encoding/decoding 无异常地工作问题):

import java.net.URLDecoder;
import java.net.URLEncoder;

public class Test {
    public static void main(String[] args) throws Exception {
        test("http://www.foo.bar/");
        test("http://www.foo.bar/?q=a+b");
        test("http://www.foo.bar/?q=äöüß%"); // Will throw exception
    }

    private static void test(String url) throws Exception {
        String encoded = URLEncoder.encode(url, "UTF-8");
        String decoded = URLDecoder.decode(encoded, "UTF-8");
        System.out.println("encoded: " + encoded);
        System.out.println("decoded: " + decoded);
        System.out.println(URLDecoder.decode(decoded, "UTF-8"));
    }
}

输出(注意 + 符号是如何消失的):

encoded: http%3A%2F%2Fwww.foo.bar%2F
decoded: http://www.foo.bar/
http://www.foo.bar/
encoded: http%3A%2F%2Fwww.foo.bar%2F%3Fq%3Da%2Bb
decoded: http://www.foo.bar/?q=a+b
http://www.foo.bar/?q=a b
encoded: http%3A%2F%2Fwww.foo.bar%2F%3Fq%3D%C3%A4%C3%B6%C3%BC%C3%9F%25
decoded: http://www.foo.bar/?q=äöüß%
Exception in thread "main" java.lang.IllegalArgumentException: URLDecoder: Incomplete trailing escape (%) pattern
    at java.net.URLDecoder.decode(Unknown Source)
    at Test.test(Test.java:16)

另见javadoc of URLDecoder两种情况:

  • The plus sign "+" is converted into a space character " " .
  • A sequence of the form "%xy" will be treated as representing a byte where xy is the two-digit hexadecimal representation of the 8 bits. Then, all substrings that contain one or more of these byte sequences consecutively will be replaced by the character(s) whose encoding would result in those consecutive bytes. The encoding scheme used to decode these characters may be specified, or if unspecified, the default encoding of the platform will be used.

如果您确定未编码的 URL 不包含 +%,那么我认为调用 URLDecoder.decode 是安全的。否则我建议实施额外的检查,例如尝试解码并与原始文件进行比较(参见 this question on SO)。