Ruby 字符串编码因版本而异

Question

我目前正在运行在 ruby 1.9.3 上处理遗留项目，并且正在研究在接下来的几个月内将其迁移到 2.3.0

我们有这行代码，它在不同的 ruby 版本上返回不同的结果。我想知道这是否是一个已修复的 ruby 错误，或者它是否是一个新错误，或者它是否是记录在案的行为变化。参考相关错误票会有所帮助。

content =  "Is your pl\xFFace available?".force_encoding("UTF-8")
content.encode("UTF-8", invalid: :replace) # some other details removed to give smallest code sample

ruby 1.9.3、2.0

的结果

"Is your pl\xFFace available?"

ruby 2.1、2.2、2.3 的结果

"Is your pl�ace available?"

基本上“\xFF”被认为是无效的，足以替换，但如果 invalid: :replace 被省略，它不会像它应该的那样引发错误。我猜这可能是因为它是一个空操作，因为 source/target 编码是相同的。

Answer 1

如果您查看 2.0 and 2.1 文档之间的文档差异，您会看到以下文本在 2.1 中消失了：

Please note that conversion from an encoding enc to the same encoding enc is a no-op, i.e. the receiver is returned without any changes, and no exceptions are raised, even if there are invalid bytes.

因此，当源编码和目标编码相同而 2.1+ 确实如此时，2.0 及更低版本未修改字符串的行为似乎是有意更改。

我不是 100% 确定你的代码试图做什么，但如果它试图从无效的 UTF-8 字节序列中清除字符串，你可以使用 valid_encoding? 和 scrub 从 Ruby 2.1 开始：

irb(main):055:0* content = "Is your pl\xFFace available?"
=> "Is your pl\xFFace available?"
irb(main):056:0> content.valid_encoding?
=> false
irb(main):057:0> new = content.scrub
=> "Is your pl�ace available?"
irb(main):059:0> new.valid_encoding?
=> true

编辑：

如果您查看 2.0 源代码，您将看到 str_transcode0 函数 exits immediately 如果 senc（源编码）与 denc（目标编码）相同：

    if (senc && senc == denc) {
        return NIL_P(arg2) ? -1 : dencidx;
    }

在 2.1 it scrubs the data 中，当编码相同并且您明确要求替换无效序列时：

    if (senc && senc == denc) {
        ...
        if ((ecflags & ECONV_INVALID_MASK) && explicitly_invalid_replace) {
            dest = rb_str_scrub(str, rep);
        }
        ...
    }

Ruby 字符串编码因版本而异

Ruby String encoding changed over versions

ruby

utf-8

utf

character-encoding