带有特殊字符的 ConvertTo-Json 和 ConvertFrom-Json

Question

我有一个包含一些属性的文件，其中一些属性的值包含转义字符，例如一些 Urls 和 Regex 模式。

读取内容并转换回json时，无论是否转义，内容都不正确。如果我使用非转义转换回 json，一些正则表达式会中断，如果我使用非转义进行转换，url 和一些正则表达式将中断。

我该如何解决这个问题？

最小的完整可验证示例

这里有一些简单的代码块，可以让您简单地重现问题：

内容

$fileContent = 
@"
{
    "something":  "http://domain/?x=1&y=2",
    "pattern":  "^(?!(\`|\~|\!|\@|\#|\$|\||\\|\'|\\")).*"
}
"@

使用 Unescape

如果我阅读内容，然后使用以下命令将内容转换回 json：

$fileContent | ConvertFrom-Json | ConvertTo-Json | %{[regex]::Unescape($_)}

输出（错误的）将是：

{
    "something":  "http://domain/?x=1&y=2",
    "pattern":  "^(?!(\|\~|\!|\@|\#|$|\||\|\'|\")).*"
}

没有转义

如果我阅读内容，然后使用以下命令将内容转换回 json：

$fileContent | ConvertFrom-Json | ConvertTo-Json

输出（错误的）将是：

{
    "something":  "http://domain/?x=1\u0026y=2",
    "pattern":  "^(?!(\|\~|\!|\@|\#|\$|\||\\|\\u0027|\\")).*"
}

预期结果

预期结果应与输入文件内容相同。

Answer 1

我决定不使用 Unescape，而是用它们的字符串值替换 unicode \uxxxx 字符，现在它可以正常工作了：

$fileContent = 
@"
{
    "something":  "http://domain/?x=1&y=2",
    "pattern":  "^(?!(\`|\~|\!|\@|\#|\$|\||\\|\'|\\")).*"
}
"@

$fileContent | ConvertFrom-Json | ConvertTo-Json | %{
    [Regex]::Replace($_, 
        "\u(?<Value>[a-zA-Z0-9]{4})", {
            param($m) ([char]([int]::Parse($m.Groups['Value'].Value,
                [System.Globalization.NumberStyles]::HexNumber))).ToString() } )}

生成预期输出：

{
    "something":  "http://domain/?x=1&y=\2",
    "pattern":  "^(?!(\|\~|\!|\@|\#|\$|\||\\|\'|\\")).*"
}

Answer 2

如果您不想依赖正则表达式（来自@Reza Aghaei 的回答），您可以导入仅转义控制字符的 Newtonsoft JSON library. The benefit is the default StringEscapeHandling 属性。另一个好处是避免使用 Regex 进行潜在危险的字符串替换。

这 StringEscapeHandling 也是 PowerShell Core（版本 6 及更高版本）的默认处理方式，因为从那时起他们开始在内部使用 Newtonsoft。因此，另一种选择是使用 PowerShell Core 中的 ConvertFrom-Json 和 ConvertTo-Json。

如果您导入 Newtonsoft JSON 库，您的代码将如下所示：

[Reflection.Assembly]::LoadFile("Newtonsoft.Json.dll")

$json = Get-Content -Raw -Path file.json -Encoding UTF8 # read file
$unescaped = [Newtonsoft.Json.Linq.JObject]::Parse($json) # similar to ConvertFrom-Json

$escapedElementValue = [Newtonsoft.Json.JsonConvert]::ToString($unescaped.apiName.Value) # similar to ConvertTo-Json
$escapedCompleteJson = [Newtonsoft.Json.JsonConvert]::SerializeObject($unescaped) # similar to ConvertTo-Json

Write-Output "Variable passed = $escapedElementValue"
Write-Output "Same JSON as Input = $escapedCompleteJson"

Answer 3

tl;dr

从 Powershell 7.2 开始，问题不会影响PowerShell (Core) 6+ (the install-on-demand, cross-platform PowerShell edition), which uses a different implementation of the ConvertTo-Json and ConvertFrom-Json cmdlets, based on Newtonsoft.JSON (whose direct use is shown in ）。在那里，您的示例往返命令按预期工作。

只有 ConvertTo-Json 在 Windows PowerShell 中受到影响（与-[=178= 捆绑在一起） ] PowerShell 版本，最新和最终版本为 5.1)。但请注意，JSON 表示 - 虽然出乎意料 - 技术上正确.

一个简单但稳健的解决方案只专注于取消转义ConvertTo-Json意外创建的那些Unicode转义序列——即& ' < > - 同时排除误报：

# The following sample JSON with undesired Unicode escape sequences for `& < > '`, was
# created with Windows PowerShell's ConvertTo-Json as follows:
#   ConvertTo-Json "Ten o'clock at <night> & later. \u0027 \u0027"
# Note that \u0027 and \u0027 are NOT Unicode escape sequences and must not be
# interpreted as such.
# The *desired* JSON representation - without the unexpected escaping - would be:
#   "Ten o'clock at <night> & later. \u0027 \\u0027"
$json = '"Ten o\u0027clock at \u003Cnight\u003e \u0026 later. \u0027 \\u0027"'

[regex]::replace(
  $json, 
  '(?<=(?:^|[^\])(?:\\)*)\u(00(?:26|27|3c|3e))', 
  { param($match) [char] [int] ('0x' + $match.Groups[1].Value) },
  'IgnoreCase'
)

以上输出所需的 JSON 表示，没有不必要的转义：

"Ten o'clock at <night> & later. \u0027 \\u0027"

背景资料:

ConvertTo-Json 在 Windows PowerShell 中意外地用它们的 Unicode 转义序列表示以下 ASCII 范围字符在 JSON 个字符串中：

&（Unicode 转义序列：\u0026）
' (\u0027)
< 和 >（\u003c 和 \u003e）

没有充分的理由这样做（这些字符只需要在 HTML/XML 文本中转义）。

但是，任何兼容的 JSON 解析器 - 包括 ConvertFrom-Json - 都会将这些转义序列转换回它们所代表的字符。

换句话说：虽然 Windows PowerShell 的 ConvertTo-Json 创建的 JSON 文本是出乎意料的，并且会阻碍 可读性 ，它在技术上是正确的并且-虽然不相同-等同就其所代表的数据而言的原始表示。

解决可读性问题:

顺便说一句：虽然 [regex]::Unescape() 的目的是仅对 正则表达式进行转义 ，但它也将 Unicode 转义序列转换为它们所代表的字符，从根本上讲不适合 有选择地 未转义 Unicode 序列 JSON 字符串，因为所有 other \ 转义必须保留以便 JSON 字符串在语法上保持有效。

虽然通常效果很好，但它有局限性（除了 a-zA-Z 应该 a-fA-F 的容易纠正的问题将匹配限制为有效的十六进制字母。数字):

不排除误报，比如\u0027或者\\u0027（\ 转义 \，以便 u0027 部分成为逐字字符串，不能被视为转义序列）。
它转换所有 Unicode转义序列，这会带来两个问题：
- 代表需要转义的字符的转义序列也将被转换为逐字字符表示，这将破坏 JSON 与 \u005c 的表示，例如，假设它表示的字符 \ 需要转义。
- 对于非 BMP Unicode 字符，必须表示为对 Unicode 转义序列（所谓的 代理对 ), 你的解决方案会错误地尝试分别对 对的每一半 进行转义。

有关克服这些限制的稳健解决方案，请参阅this answer （代理对留作 Unicode 转义序列，Unicode 转义序列如果可能，需要转义的字符将转换为基于 \ 的（C 风格）转义，例如 \n。

但是，如果唯一的要求是对那些 Unicode 转义序列进行转义 Windows PowerShell 的 ConvertTo-Json 意外地 创建，顶部的解决方案就足够了。

带有特殊字符的 ConvertTo-Json 和 ConvertFrom-Json

ConvertTo-Json and ConvertFrom-Json with special characters

powershell

escaping

convertto-json

convertfrom-json