无法解析 JSON?或者是 JavaScript 文本(如何解析自定义配置文件)

Unable to parse JSON? Or is it JavaScript text (How to parse a custom config file)

我正在尝试解析我最初怀疑是来自服务器的 JSON 配置文件。

经过一些尝试,当我选择格式化程序 JavaScript 时,我能够在 Notepad++ 中导航和折叠各部分。

然而,我对如何将这些数据 convert/parse 转换为 JSON/another 格式感到困惑,没有任何在线工具能够帮助解决这个问题。

我该如何解析这段文字?理想情况下,我尝试使用 PowerShell,但如果我能弄清楚如何开始转换,Python 也是一个选择。

例如,我正在尝试解析每个服务器,即。 test1、test2、test3 并获取每个块中列出的数据。

这是配置文件格式的示例:

servername {
  store {
    servers {
      * {
        value<>
        port<>
        folder<C:\windows>
        monitor<yes>
        args<-T -H>
        xrg<store>
        wysargs<-t -g -b>
        accept_any<yes>
        pdu_length<23622>
      }
      test1 {
        name<test1>
        port<123>
        root<c:\test>
        monitor<yes>
      }
      test2 {
        name<test2>
        port<124>
        root<c:\test>
        monitor<yes>
      }
      test3 {
        name<test3>
        port<125>
        root<c:\test>
        monitor<yes>
      }
    }
    senders
    timeout<30>
  }
}

这是将上述配置文件转换为 python 中的 dict/json 的东西。我只是按照@zett42 的建议做一些正则表达式。

import re
import json

lines = open('configfile', 'r').read()

# Quotations around the keys (next 3 lines)
lines2 = re.sub(r'([a-zA-Z\d_*]+)\s?{', r'"": {', lines)
# Process k<v> as Key, Value pairs 
lines3 = re.sub(r'([a-zA-Z\d_*]+)\s?<([^<]*)>', r'"": ""', lines2) 
# Process single key word on the line as Key, value pair with empty value
lines4 = re.sub(r'^\s*([a-zA-Z\d_*]+)\s*$', r'"": ""', lines3, flags=re.MULTILINE)

# Insert replace \n with commas in lines ending with "
lines5 = re.sub(r'"\n', '",', lines4)

# Remove the comma before the closing bracket
lines6 = re.sub(r',\s*}', '}', lines5)

# Remove quotes from numerical values
lines7 = re.sub(r'"(\d+)"', r'', lines6)

# Add commas after closing brackets when needed
lines8 = re.sub(r'[ \t\r\f]+(?!-)', '', lines7)
lines9 = re.sub(r'(?<=})\n(?=")', r",\n", lines8)

# Enclose in brackets and escape backslash for json parsing
lines10 = '{' + lines9.replace('\', '\\') + '}'

j = json.JSONDecoder().decode(lines10)

编辑: 这是一个可能更简洁的替代方案

# Replace line with just key with key<>
lines2 = re.sub(r'^([^{<>}]+)$', r'<>', lines, flags=re.MULTILINE)
# Remove spaces not within <>
lines3 = re.sub(r'\s(?!.*?>)|\s(?![^<]+>)', '', lines2, flags=re.MULTILINE)
# Quotations
lines4 = re.sub(r'([^{<>}]+)(?={)', r'"":', lines3)
lines5 = re.sub(r'([^:{<>}]+)<([^{<>}]*)>', r'"":""', lines4)
# Add commas
lines6 = re.sub(r'(?<=")"(?!")', ',"', lines5)
lines7 = re.sub(r'}(?!}|$)', '},', lines6)
# Remove quotes from numbers
lines8 = re.sub(r'"(\d+)"', r'', lines7)
# Escape \
lines9 = '{' + re.sub(r'\', r'\\', lines8) + '}'

编辑: 我想出了一个 我推荐使用。

我会保留这个答案,因为它可能对其他情况仍然有用。在性能上也可能存在差异(我没有测量)。


MYousefi already posted a helpful answer 实现 Python。

对于 PowerShell,我提出了一个无需 convert-to-JSON 步骤即可运行的解决方案。相反,我采用并概括了 RegEx-based tokenizer code from Jack Vanlightly (also see related blog post)。 tokenizer(又名 lexer)对输入文本的元素进行拆分和分类,并输出 tokens[=71 的平面流=](类别)和相关数据。 解析器 可以使用这些作为输入来创建输入文本的结构化表示。

分词器是用通用 C# 编写的,可用于任何可以使用 RegEx 拆分的输入。使用 Add-Type 命令将 C# 代码包含在 PowerShell 中,因此不需要 C# 编译器。

为了简单起见,解析器函数 ConvertFrom-ServerData 是用 PowerShell 编写的。您只需直接使用解析器,因此您不必了解分词器 C# 代码的任何信息。如果要将代码应用于不同的输入,只需修改 PowerShell 解析器代码即可。

将以下文件保存在与 PowerShell 脚本相同的目录中:

"RegExTokenizer.cs":

// Generic, precedence-based RegEx tokenizer.
// This code is based on https://github.com/Vanlightly/DslParser 
// from Jack Vanlightly (https://jack-vanlightly.com).
// Modifications:
// - Interface improved for ease-of-use from PowerShell.
// - Return all groups from the RegEx match instead of just the value. This simplifies parsing of key/value pairs by requiring only a single token definition.
// - Some code simplifications, e. g. replacing "for" loops by "foreach".

using System;
using System.Collections.Generic;
using System.Linq;
using System.Text;
using System.Threading.Tasks;
using System.Text.RegularExpressions;

namespace DslTokenizer {
    public class DslToken<TokenType> {
        public TokenType Token { get; set; }
        public GroupCollection Groups { get; set; }
    }

    public class TokenMatch<TokenType> {
        public TokenType Token { get; set; }
        public GroupCollection Groups { get; set; }
        public int StartIndex { get; set; }
        public int EndIndex { get; set; }
        public int Precedence { get; set; }
    }

    public class TokenDefinition<TokenType> {
        private Regex _regex;
        private readonly TokenType _returnsToken;
        private readonly int _precedence;

        public TokenDefinition( TokenType returnsToken, string regexPattern, int precedence ) {
            _regex = new Regex( regexPattern, RegexOptions.Multiline | RegexOptions.IgnoreCase | RegexOptions.Compiled );
            _returnsToken = returnsToken;
            _precedence = precedence;
        }

        public IEnumerable<TokenMatch<TokenType>> FindMatches( string inputString ) {

            foreach( Match match in _regex.Matches( inputString ) ) {
                yield return new TokenMatch<TokenType>() {
                    StartIndex = match.Index,
                    EndIndex   = match.Index + match.Length,
                    Token      = _returnsToken,
                    Groups     = match.Groups,
                    Precedence = _precedence
                };
            }
        }
    }

    public class PrecedenceBasedRegexTokenizer<TokenType> {

        private List<TokenDefinition<TokenType>> _tokenDefinitions = new List<TokenDefinition<TokenType>>();

        public PrecedenceBasedRegexTokenizer() {}

        public PrecedenceBasedRegexTokenizer( IEnumerable<TokenDefinition<TokenType>> tokenDefinitions ) {
            _tokenDefinitions = tokenDefinitions.ToList();
        }

        // Easy-to-use interface as alternative to constructor that takes an IEnumerable.
        public void AddTokenDef( TokenType returnsToken, string regexPattern, int precedence = 0 ) {
            _tokenDefinitions.Add( new TokenDefinition<TokenType>( returnsToken, regexPattern, precedence ) );
        }

        public IEnumerable<DslToken<TokenType>> Tokenize( string lqlText ) {

            var tokenMatches = FindTokenMatches( lqlText );

            var groupedByIndex = tokenMatches.GroupBy( x => x.StartIndex )
                .OrderBy( x => x.Key )
                .ToList();

            TokenMatch<TokenType> lastMatch = null;

            foreach( var match in groupedByIndex ) {

                var bestMatch = match.OrderBy( x => x.Precedence ).First();
                if( lastMatch != null && bestMatch.StartIndex < lastMatch.EndIndex ) {
                    continue;
                }

                yield return new DslToken<TokenType>(){ Token = bestMatch.Token, Groups = bestMatch.Groups };

                lastMatch = bestMatch;
            }
        }

        private List<TokenMatch<TokenType>> FindTokenMatches( string lqlText ) {

            var tokenMatches = new List<TokenMatch<TokenType>>();

            foreach( var tokenDefinition in _tokenDefinitions ) {
                tokenMatches.AddRange( tokenDefinition.FindMatches( lqlText ).ToList() );
            }
            return tokenMatches;
        }
    }        
}

用 PowerShell 编写的解析器函数:

$ErrorActionPreference = 'Stop'

Add-Type -TypeDefinition (Get-Content $PSScriptRoot\RegExTokenizer.cs -Raw)

Function ConvertFrom-ServerData {
    [CmdletBinding()]
    param (
        [Parameter(Mandatory, ValueFromPipeline)] [string] $InputObject
    )

    begin {
        # Define the kind of possible tokens.
        enum ServerDataTokens {
            ObjectBegin
            ObjectEnd
            ValueInt
            ValueBool
            ValueString
            KeyOnly
        }
        
        # Create an instance of the tokenizer from "RegExTokenizer.cs".
        $tokenizer = [DslTokenizer.PrecedenceBasedRegexTokenizer[ServerDataTokens]]::new()

        # Define a RegEx for each token where 1st group matches key and 2nd matches value (if any).
        # To resolve ambiguities, most specific RegEx must come first 
        # (e. g. ValueInt line must come before ValueString line).
        # Alternatively pass a 3rd integer parameter that defines the precedence.        
        $tokenizer.AddTokenDef( [ServerDataTokens]::ObjectBegin, '^\s*([\w*]+)\s*{' )
        $tokenizer.AddTokenDef( [ServerDataTokens]::ObjectEnd,   '^\s*}\s*$' )
        $tokenizer.AddTokenDef( [ServerDataTokens]::ValueInt,    '^\s*(\w+)\s*<([+-]?\d+)>\s*$' )
        $tokenizer.AddTokenDef( [ServerDataTokens]::ValueBool,   '^\s*(\w+)\s*<(yes|no)>\s*$' )
        $tokenizer.AddTokenDef( [ServerDataTokens]::ValueString, '^\s*(\w+)\s*<(.*)>\s*$' )
        $tokenizer.AddTokenDef( [ServerDataTokens]::KeyOnly,     '^\s*(\w+)\s*$' )
    }

    process {
        # Output is an ordered hashtable
        $outputObject = [ordered] @{}

        $curObject = $outputObject

        # A stack is used to keep track of nested objects.
        $stack = [Collections.Stack]::new()
        
        # For each token produced by the tokenizer
        $tokenizer.Tokenize( $InputObject ).ForEach{
        
            # $_.Groups[0] is the full match, which we discard by assigning to $null 
            $null, $key, $value = $_.Groups.Value
            
            switch( $_.Token ) {
                ([ServerDataTokens]::ObjectBegin) {  
                    $child = [ordered] @{} 
                    $curObject[ $key ] = $child
                    $stack.Push( $curObject )
                    $curObject = $child
                    break
                }
                ([ServerDataTokens]::ObjectEnd) {
                    $curObject = $stack.Pop()
                    break
                }
                ([ServerDataTokens]::ValueInt) {
                    $intValue = 0
                    $curObject[ $key ] = if( [int]::TryParse( $value, [ref] $intValue ) ) { $intValue } else { $value }
                    break
                }
                ([ServerDataTokens]::ValueBool) {
                    $curObject[ $key ] = $value -eq 'yes'
                    break
                }
                ([ServerDataTokens]::ValueString) {
                    $curObject[ $key ] = $value
                    break
                }
                ([ServerDataTokens]::KeyOnly) {
                    $curObject[ $key ] = $null
                    break
                }
            }
        }

        $outputObject  # Implicit output
    }
}

用法示例:

$sampleData = @'
servername {
    store {
      servers {
        * {
          value<>
          port<>
          folder<C:\windows>
          monitor<yes>
          args<-T -H>
          xrg<store>
          wysargs<-t -g -b>
          accept_any<yes>
          pdu_length<23622>
        }
        test1 {
          name<test1>
          port<123>
          root<c:\test>
          monitor<yes>
        }
        test2 {
          name<test2>
          port<124>
          root<c:\test>
          monitor<yes>
        }
        test3 {
          name<test3>
          port<125>
          root<c:\test>
          monitor<yes>
        }
      }
      senders
      timeout<30>
    }
  }
'@

# Call the parser
$objects = $sampleData | ConvertFrom-ServerData

# The parser outputs nested hashtables, so we have to use GetEnumerator() to
# iterate over the key/value pairs.

$objects.servername.store.servers.GetEnumerator().ForEach{
    "[ SERVER: $($_.Key) ]"
    # Convert server values hashtable to PSCustomObject for better output formatting
    [PSCustomObject] $_.Value | Format-List
}

输出:

[ SERVER: * ]

value      : 
port       : 
folder     : C:\windows
monitor    : True      
args       : -T -H     
xrg        : store     
wysargs    : -t -g -b  
accept_any : True      
pdu_length : 23622     


[ SERVER: test1 ]      

name    : test1        
port    : 123
root    : c:\test      
monitor : True


[ SERVER: test2 ]      

name    : test2        
port    : 124
root    : c:\test      
monitor : True


[ SERVER: test3 ]

name    : test3
port    : 125
root    : c:\test
monitor : True

备注:

  • 如果您将来自 Get-Content 的输入传递给解析器,请确保使用参数 -Raw,例如。 G。 $objects = Get-Content input.cfg -Raw | ConvertFrom-ServerData。否则解析器将尝试自己解析每个输入行。
  • 我选择将“是”/“否”值转换为 bool,因此它们输出为“真”/“假”。删除行 $tokenizer.AddTokenDef( 'ValueBool', ... 以将它们解析为 string 并输出 as-is.
  • 没有值 <> 的键(示例中的“发送者”)存储为具有值 $null.
  • 的键
  • RegEx 强制值只能是 single-line(如示例数据所示)。这允许我们嵌入 > 个字符而无需转义它们。

我想出了一个比 更简单的解决方案,它仅使用 PowerShell 代码。

使用RegEx alternation operator | we combine all token patterns into a single pattern and use named subexpressions判断实际匹配到哪一个

其余代码在结构上与 C#/PS 版本相似。

using namespace System.Text.RegularExpressions

$ErrorActionPreference = 'Stop'

Function ConvertFrom-ServerData {
    [CmdletBinding()]
    param (
        [Parameter(Mandatory, ValueFromPipeline)] [string] $InputObject
    )

    begin {
        # Key can consist of anything except whitespace and < > { }
        $keyPattern = '[^\s<>{}]+'

        # Order of the patterns is important 
        $pattern = (
            "(?<IntKey>$keyPattern)\s*<(?<IntValue>\d+)>",
            "(?<TrueKey>$keyPattern)\s*<yes>",
            "(?<FalseKey>$keyPattern)\s*<no>",
            "(?<StrKey>$keyPattern)\s*<(?<StrValue>.*?)>",
            "(?<ObjectBegin>$keyPattern)\s*{",
            "(?<ObjectEnd>})",
            "(?<KeyOnly>$keyPattern)",
            "(?<Invalid>\S+)"  # any non-whitespace sequence that didn't match the valid patterns
        ) -join '|'
    }

    process {
        # Output is an ordered hashtable
        $curObject = $outputObject = [ordered] @{}

        # A stack is used to keep track of nested objects.
        $stack = [Collections.Stack]::new()
        
        # For each pattern match
        foreach( $match in [RegEx]::Matches( $InputObject, $pattern, [RegexOptions]::Multiline ) ) {

            # Get the RegEx groups that have actually matched.
            $matchGroups = $match.Groups.Where{ $_.Success -and $_.Name.Length -gt 1 }

            $key = $matchGroups[ 0 ].Value

            switch( $matchGroups[ 0 ].Name ) {
                'ObjectBegin' {
                    $child = [ordered] @{} 
                    $curObject[ $key ] = $child
                    $stack.Push( $curObject )
                    $curObject = $child
                    break                    
                }
                'ObjectEnd' {
                    if( $stack.Count -eq 0 ) {
                        Write-Error -EA Stop "Parse error: Curly braces are unbalanced. There are more '}' than '{' in config data."
                    }
                    $curObject = $stack.Pop()
                    break
                }
                'IntKey' {
                    $value = $matchGroups[ 1 ].Value 
                    $intValue = 0
                    $curObject[ $key ] = if( [int]::TryParse( $value, [ref] $intValue ) ) { $intValue } else { $value }
                    break
                }
                'TrueKey' {
                    $curObject[ $key ] = $true
                    break
                }
                'FalseKey' {
                    $curObject[ $key ] = $false
                    break
                }
                'StrKey' {
                    $value = $matchGroups[ 1 ].Value
                    $curObject[ $key ] = $value
                    break
                }
                'KeyOnly' {
                    $curObject[ $key ] = $null
                    break
                }
                'Invalid' {
                    Write-Warning "Invalid token at index $($match.Index): $key"
                    break
                }
            }
        }

        if( $stack.Count -gt 0 ) {
            Write-Error "Parse error: Curly braces are unbalanced. There are more '{' than '}' in config data."
        }

        $outputObject  # Implicit output
    }
}

用法示例:


$sampleData = @'
test-server {
    store {
      servers {
        * {
          value<>
          port<>
          folder<C:\windows> monitor<yes>
          args<-T -H>
          xrg<store>
          wysargs<-t -g -b>
          accept_any<yes>
          pdu_length<23622>
        }
        test1 {
          name<test1>
          port<123>
          root<c:\test>
          monitor<yes>
        }
        test2 {
          name<test2>
          port<124>
          root<c:\test>
          monitor<yes>
        }
        test3 {
          name<test3>
          port<125>
          root<c:\test>
          monitor<yes>
        }
      }
      senders
      timeout<30>
    }
  }
'@

# Call the parser
$objects = $sampleData | ConvertFrom-ServerData

# Uncomment to verify the whole result
#$objects | ConvertTo-Json -Depth 10

# The parser outputs nested hashtables, so we have to use GetEnumerator() to
# iterate over the key/value pairs.
$objects.'test-server'.store.servers.GetEnumerator().ForEach{
    "[ SERVER: $($_.Key) ]"
    # Convert server values hashtable to PSCustomObject for better output formatting
    [PSCustomObject] $_.Value | Format-List
}

输出:

[ SERVER: * ]

value      : 
port       : 
folder     : C:\windows
monitor    : True
args       : -T -H
xrg        : store
wysargs    : -t -g -b
accept_any : True
pdu_length : 23622


[ SERVER: test1 ]

name    : test1
port    : 123
root    : c:\test
monitor : True


[ SERVER: test2 ]

name    : test2
port    : 124
root    : c:\test
monitor : True


[ SERVER: test3 ]

name    : test3
port    : 125
root    : c:\test
monitor : True

备注:

  • 我进一步放宽了正则表达式。键现在可以包含除空格之外的任何字符,<>{}.
  • 不再需要换行符。这更灵活,但你不能有嵌入 > 字符的字符串。让我知道这是否有问题。
  • 我添加了对无效标记的检测,这些标记作为警告输出。如果您想忽略无效标记,请删除 "(?<Invalid>\S+)" 行。
  • 检测到不平衡的大括号并报告为错误。
  • 您可以在 RegEx101 查看 RegEx 的工作原理并获得解释。

C#版本:

  • 我已经创建了一个更快的 C# version of the parser
  • 它需要 PowerShell 7+。它可以使用 Add-Type -Path FileName.cs 导入并像 [zett42.ServerDataParser]::Parse($text).
  • 一样调用
  • 它不使用正则表达式。它基于 ReadOnlySpan<char> 并且仅使用简单的字符串操作。在我做的基准测试中,它比使用 RegEx 的 C# 版本快大约 10 倍,比 PS-only 版本快大约 60 倍。