无法解析 JSON?或者是 JavaScript 文本(如何解析自定义配置文件)
Unable to parse JSON? Or is it JavaScript text (How to parse a custom config file)
我正在尝试解析我最初怀疑是来自服务器的 JSON 配置文件。
经过一些尝试,当我选择格式化程序 JavaScript 时,我能够在 Notepad++ 中导航和折叠各部分。
然而,我对如何将这些数据 convert/parse 转换为 JSON/another 格式感到困惑,没有任何在线工具能够帮助解决这个问题。
我该如何解析这段文字?理想情况下,我尝试使用 PowerShell,但如果我能弄清楚如何开始转换,Python 也是一个选择。
例如,我正在尝试解析每个服务器,即。 test1、test2、test3 并获取每个块中列出的数据。
这是配置文件格式的示例:
servername {
store {
servers {
* {
value<>
port<>
folder<C:\windows>
monitor<yes>
args<-T -H>
xrg<store>
wysargs<-t -g -b>
accept_any<yes>
pdu_length<23622>
}
test1 {
name<test1>
port<123>
root<c:\test>
monitor<yes>
}
test2 {
name<test2>
port<124>
root<c:\test>
monitor<yes>
}
test3 {
name<test3>
port<125>
root<c:\test>
monitor<yes>
}
}
senders
timeout<30>
}
}
这是将上述配置文件转换为 python 中的 dict/json 的东西。我只是按照@zett42 的建议做一些正则表达式。
import re
import json
lines = open('configfile', 'r').read()
# Quotations around the keys (next 3 lines)
lines2 = re.sub(r'([a-zA-Z\d_*]+)\s?{', r'"": {', lines)
# Process k<v> as Key, Value pairs
lines3 = re.sub(r'([a-zA-Z\d_*]+)\s?<([^<]*)>', r'"": ""', lines2)
# Process single key word on the line as Key, value pair with empty value
lines4 = re.sub(r'^\s*([a-zA-Z\d_*]+)\s*$', r'"": ""', lines3, flags=re.MULTILINE)
# Insert replace \n with commas in lines ending with "
lines5 = re.sub(r'"\n', '",', lines4)
# Remove the comma before the closing bracket
lines6 = re.sub(r',\s*}', '}', lines5)
# Remove quotes from numerical values
lines7 = re.sub(r'"(\d+)"', r'', lines6)
# Add commas after closing brackets when needed
lines8 = re.sub(r'[ \t\r\f]+(?!-)', '', lines7)
lines9 = re.sub(r'(?<=})\n(?=")', r",\n", lines8)
# Enclose in brackets and escape backslash for json parsing
lines10 = '{' + lines9.replace('\', '\\') + '}'
j = json.JSONDecoder().decode(lines10)
编辑:
这是一个可能更简洁的替代方案
# Replace line with just key with key<>
lines2 = re.sub(r'^([^{<>}]+)$', r'<>', lines, flags=re.MULTILINE)
# Remove spaces not within <>
lines3 = re.sub(r'\s(?!.*?>)|\s(?![^<]+>)', '', lines2, flags=re.MULTILINE)
# Quotations
lines4 = re.sub(r'([^{<>}]+)(?={)', r'"":', lines3)
lines5 = re.sub(r'([^:{<>}]+)<([^{<>}]*)>', r'"":""', lines4)
# Add commas
lines6 = re.sub(r'(?<=")"(?!")', ',"', lines5)
lines7 = re.sub(r'}(?!}|$)', '},', lines6)
# Remove quotes from numbers
lines8 = re.sub(r'"(\d+)"', r'', lines7)
# Escape \
lines9 = '{' + re.sub(r'\', r'\\', lines8) + '}'
编辑: 我想出了一个 我推荐使用。
我会保留这个答案,因为它可能对其他情况仍然有用。在性能上也可能存在差异(我没有测量)。
MYousefi already posted a helpful answer 实现 Python。
对于 PowerShell,我提出了一个无需 convert-to-JSON 步骤即可运行的解决方案。相反,我采用并概括了 RegEx-based tokenizer code from Jack Vanlightly (also see related blog post)。 tokenizer(又名 lexer)对输入文本的元素进行拆分和分类,并输出 tokens[=71 的平面流=](类别)和相关数据。 解析器 可以使用这些作为输入来创建输入文本的结构化表示。
分词器是用通用 C# 编写的,可用于任何可以使用 RegEx 拆分的输入。使用 Add-Type
命令将 C# 代码包含在 PowerShell 中,因此不需要 C# 编译器。
为了简单起见,解析器函数 ConvertFrom-ServerData
是用 PowerShell 编写的。您只需直接使用解析器,因此您不必了解分词器 C# 代码的任何信息。如果要将代码应用于不同的输入,只需修改 PowerShell 解析器代码即可。
将以下文件保存在与 PowerShell 脚本相同的目录中:
"RegExTokenizer.cs":
// Generic, precedence-based RegEx tokenizer.
// This code is based on https://github.com/Vanlightly/DslParser
// from Jack Vanlightly (https://jack-vanlightly.com).
// Modifications:
// - Interface improved for ease-of-use from PowerShell.
// - Return all groups from the RegEx match instead of just the value. This simplifies parsing of key/value pairs by requiring only a single token definition.
// - Some code simplifications, e. g. replacing "for" loops by "foreach".
using System;
using System.Collections.Generic;
using System.Linq;
using System.Text;
using System.Threading.Tasks;
using System.Text.RegularExpressions;
namespace DslTokenizer {
public class DslToken<TokenType> {
public TokenType Token { get; set; }
public GroupCollection Groups { get; set; }
}
public class TokenMatch<TokenType> {
public TokenType Token { get; set; }
public GroupCollection Groups { get; set; }
public int StartIndex { get; set; }
public int EndIndex { get; set; }
public int Precedence { get; set; }
}
public class TokenDefinition<TokenType> {
private Regex _regex;
private readonly TokenType _returnsToken;
private readonly int _precedence;
public TokenDefinition( TokenType returnsToken, string regexPattern, int precedence ) {
_regex = new Regex( regexPattern, RegexOptions.Multiline | RegexOptions.IgnoreCase | RegexOptions.Compiled );
_returnsToken = returnsToken;
_precedence = precedence;
}
public IEnumerable<TokenMatch<TokenType>> FindMatches( string inputString ) {
foreach( Match match in _regex.Matches( inputString ) ) {
yield return new TokenMatch<TokenType>() {
StartIndex = match.Index,
EndIndex = match.Index + match.Length,
Token = _returnsToken,
Groups = match.Groups,
Precedence = _precedence
};
}
}
}
public class PrecedenceBasedRegexTokenizer<TokenType> {
private List<TokenDefinition<TokenType>> _tokenDefinitions = new List<TokenDefinition<TokenType>>();
public PrecedenceBasedRegexTokenizer() {}
public PrecedenceBasedRegexTokenizer( IEnumerable<TokenDefinition<TokenType>> tokenDefinitions ) {
_tokenDefinitions = tokenDefinitions.ToList();
}
// Easy-to-use interface as alternative to constructor that takes an IEnumerable.
public void AddTokenDef( TokenType returnsToken, string regexPattern, int precedence = 0 ) {
_tokenDefinitions.Add( new TokenDefinition<TokenType>( returnsToken, regexPattern, precedence ) );
}
public IEnumerable<DslToken<TokenType>> Tokenize( string lqlText ) {
var tokenMatches = FindTokenMatches( lqlText );
var groupedByIndex = tokenMatches.GroupBy( x => x.StartIndex )
.OrderBy( x => x.Key )
.ToList();
TokenMatch<TokenType> lastMatch = null;
foreach( var match in groupedByIndex ) {
var bestMatch = match.OrderBy( x => x.Precedence ).First();
if( lastMatch != null && bestMatch.StartIndex < lastMatch.EndIndex ) {
continue;
}
yield return new DslToken<TokenType>(){ Token = bestMatch.Token, Groups = bestMatch.Groups };
lastMatch = bestMatch;
}
}
private List<TokenMatch<TokenType>> FindTokenMatches( string lqlText ) {
var tokenMatches = new List<TokenMatch<TokenType>>();
foreach( var tokenDefinition in _tokenDefinitions ) {
tokenMatches.AddRange( tokenDefinition.FindMatches( lqlText ).ToList() );
}
return tokenMatches;
}
}
}
用 PowerShell 编写的解析器函数:
$ErrorActionPreference = 'Stop'
Add-Type -TypeDefinition (Get-Content $PSScriptRoot\RegExTokenizer.cs -Raw)
Function ConvertFrom-ServerData {
[CmdletBinding()]
param (
[Parameter(Mandatory, ValueFromPipeline)] [string] $InputObject
)
begin {
# Define the kind of possible tokens.
enum ServerDataTokens {
ObjectBegin
ObjectEnd
ValueInt
ValueBool
ValueString
KeyOnly
}
# Create an instance of the tokenizer from "RegExTokenizer.cs".
$tokenizer = [DslTokenizer.PrecedenceBasedRegexTokenizer[ServerDataTokens]]::new()
# Define a RegEx for each token where 1st group matches key and 2nd matches value (if any).
# To resolve ambiguities, most specific RegEx must come first
# (e. g. ValueInt line must come before ValueString line).
# Alternatively pass a 3rd integer parameter that defines the precedence.
$tokenizer.AddTokenDef( [ServerDataTokens]::ObjectBegin, '^\s*([\w*]+)\s*{' )
$tokenizer.AddTokenDef( [ServerDataTokens]::ObjectEnd, '^\s*}\s*$' )
$tokenizer.AddTokenDef( [ServerDataTokens]::ValueInt, '^\s*(\w+)\s*<([+-]?\d+)>\s*$' )
$tokenizer.AddTokenDef( [ServerDataTokens]::ValueBool, '^\s*(\w+)\s*<(yes|no)>\s*$' )
$tokenizer.AddTokenDef( [ServerDataTokens]::ValueString, '^\s*(\w+)\s*<(.*)>\s*$' )
$tokenizer.AddTokenDef( [ServerDataTokens]::KeyOnly, '^\s*(\w+)\s*$' )
}
process {
# Output is an ordered hashtable
$outputObject = [ordered] @{}
$curObject = $outputObject
# A stack is used to keep track of nested objects.
$stack = [Collections.Stack]::new()
# For each token produced by the tokenizer
$tokenizer.Tokenize( $InputObject ).ForEach{
# $_.Groups[0] is the full match, which we discard by assigning to $null
$null, $key, $value = $_.Groups.Value
switch( $_.Token ) {
([ServerDataTokens]::ObjectBegin) {
$child = [ordered] @{}
$curObject[ $key ] = $child
$stack.Push( $curObject )
$curObject = $child
break
}
([ServerDataTokens]::ObjectEnd) {
$curObject = $stack.Pop()
break
}
([ServerDataTokens]::ValueInt) {
$intValue = 0
$curObject[ $key ] = if( [int]::TryParse( $value, [ref] $intValue ) ) { $intValue } else { $value }
break
}
([ServerDataTokens]::ValueBool) {
$curObject[ $key ] = $value -eq 'yes'
break
}
([ServerDataTokens]::ValueString) {
$curObject[ $key ] = $value
break
}
([ServerDataTokens]::KeyOnly) {
$curObject[ $key ] = $null
break
}
}
}
$outputObject # Implicit output
}
}
用法示例:
$sampleData = @'
servername {
store {
servers {
* {
value<>
port<>
folder<C:\windows>
monitor<yes>
args<-T -H>
xrg<store>
wysargs<-t -g -b>
accept_any<yes>
pdu_length<23622>
}
test1 {
name<test1>
port<123>
root<c:\test>
monitor<yes>
}
test2 {
name<test2>
port<124>
root<c:\test>
monitor<yes>
}
test3 {
name<test3>
port<125>
root<c:\test>
monitor<yes>
}
}
senders
timeout<30>
}
}
'@
# Call the parser
$objects = $sampleData | ConvertFrom-ServerData
# The parser outputs nested hashtables, so we have to use GetEnumerator() to
# iterate over the key/value pairs.
$objects.servername.store.servers.GetEnumerator().ForEach{
"[ SERVER: $($_.Key) ]"
# Convert server values hashtable to PSCustomObject for better output formatting
[PSCustomObject] $_.Value | Format-List
}
输出:
[ SERVER: * ]
value :
port :
folder : C:\windows
monitor : True
args : -T -H
xrg : store
wysargs : -t -g -b
accept_any : True
pdu_length : 23622
[ SERVER: test1 ]
name : test1
port : 123
root : c:\test
monitor : True
[ SERVER: test2 ]
name : test2
port : 124
root : c:\test
monitor : True
[ SERVER: test3 ]
name : test3
port : 125
root : c:\test
monitor : True
备注:
- 如果您将来自
Get-Content
的输入传递给解析器,请确保使用参数 -Raw
,例如。 G。 $objects = Get-Content input.cfg -Raw | ConvertFrom-ServerData
。否则解析器将尝试自己解析每个输入行。
- 我选择将“是”/“否”值转换为
bool
,因此它们输出为“真”/“假”。删除行 $tokenizer.AddTokenDef( 'ValueBool', ...
以将它们解析为 string
并输出 as-is.
- 没有值
<>
的键(示例中的“发送者”)存储为具有值 $null
. 的键
- RegEx 强制值只能是 single-line(如示例数据所示)。这允许我们嵌入
>
个字符而无需转义它们。
我想出了一个比 更简单的解决方案,它仅使用 PowerShell 代码。
使用RegEx alternation operator |
we combine all token patterns into a single pattern and use named subexpressions判断实际匹配到哪一个
其余代码在结构上与 C#/PS 版本相似。
using namespace System.Text.RegularExpressions
$ErrorActionPreference = 'Stop'
Function ConvertFrom-ServerData {
[CmdletBinding()]
param (
[Parameter(Mandatory, ValueFromPipeline)] [string] $InputObject
)
begin {
# Key can consist of anything except whitespace and < > { }
$keyPattern = '[^\s<>{}]+'
# Order of the patterns is important
$pattern = (
"(?<IntKey>$keyPattern)\s*<(?<IntValue>\d+)>",
"(?<TrueKey>$keyPattern)\s*<yes>",
"(?<FalseKey>$keyPattern)\s*<no>",
"(?<StrKey>$keyPattern)\s*<(?<StrValue>.*?)>",
"(?<ObjectBegin>$keyPattern)\s*{",
"(?<ObjectEnd>})",
"(?<KeyOnly>$keyPattern)",
"(?<Invalid>\S+)" # any non-whitespace sequence that didn't match the valid patterns
) -join '|'
}
process {
# Output is an ordered hashtable
$curObject = $outputObject = [ordered] @{}
# A stack is used to keep track of nested objects.
$stack = [Collections.Stack]::new()
# For each pattern match
foreach( $match in [RegEx]::Matches( $InputObject, $pattern, [RegexOptions]::Multiline ) ) {
# Get the RegEx groups that have actually matched.
$matchGroups = $match.Groups.Where{ $_.Success -and $_.Name.Length -gt 1 }
$key = $matchGroups[ 0 ].Value
switch( $matchGroups[ 0 ].Name ) {
'ObjectBegin' {
$child = [ordered] @{}
$curObject[ $key ] = $child
$stack.Push( $curObject )
$curObject = $child
break
}
'ObjectEnd' {
if( $stack.Count -eq 0 ) {
Write-Error -EA Stop "Parse error: Curly braces are unbalanced. There are more '}' than '{' in config data."
}
$curObject = $stack.Pop()
break
}
'IntKey' {
$value = $matchGroups[ 1 ].Value
$intValue = 0
$curObject[ $key ] = if( [int]::TryParse( $value, [ref] $intValue ) ) { $intValue } else { $value }
break
}
'TrueKey' {
$curObject[ $key ] = $true
break
}
'FalseKey' {
$curObject[ $key ] = $false
break
}
'StrKey' {
$value = $matchGroups[ 1 ].Value
$curObject[ $key ] = $value
break
}
'KeyOnly' {
$curObject[ $key ] = $null
break
}
'Invalid' {
Write-Warning "Invalid token at index $($match.Index): $key"
break
}
}
}
if( $stack.Count -gt 0 ) {
Write-Error "Parse error: Curly braces are unbalanced. There are more '{' than '}' in config data."
}
$outputObject # Implicit output
}
}
用法示例:
$sampleData = @'
test-server {
store {
servers {
* {
value<>
port<>
folder<C:\windows> monitor<yes>
args<-T -H>
xrg<store>
wysargs<-t -g -b>
accept_any<yes>
pdu_length<23622>
}
test1 {
name<test1>
port<123>
root<c:\test>
monitor<yes>
}
test2 {
name<test2>
port<124>
root<c:\test>
monitor<yes>
}
test3 {
name<test3>
port<125>
root<c:\test>
monitor<yes>
}
}
senders
timeout<30>
}
}
'@
# Call the parser
$objects = $sampleData | ConvertFrom-ServerData
# Uncomment to verify the whole result
#$objects | ConvertTo-Json -Depth 10
# The parser outputs nested hashtables, so we have to use GetEnumerator() to
# iterate over the key/value pairs.
$objects.'test-server'.store.servers.GetEnumerator().ForEach{
"[ SERVER: $($_.Key) ]"
# Convert server values hashtable to PSCustomObject for better output formatting
[PSCustomObject] $_.Value | Format-List
}
输出:
[ SERVER: * ]
value :
port :
folder : C:\windows
monitor : True
args : -T -H
xrg : store
wysargs : -t -g -b
accept_any : True
pdu_length : 23622
[ SERVER: test1 ]
name : test1
port : 123
root : c:\test
monitor : True
[ SERVER: test2 ]
name : test2
port : 124
root : c:\test
monitor : True
[ SERVER: test3 ]
name : test3
port : 125
root : c:\test
monitor : True
备注:
- 我进一步放宽了正则表达式。键现在可以包含除空格之外的任何字符,
<
、>
、{
和 }
.
- 不再需要换行符。这更灵活,但你不能有嵌入
>
字符的字符串。让我知道这是否有问题。
- 我添加了对无效标记的检测,这些标记作为警告输出。如果您想忽略无效标记,请删除
"(?<Invalid>\S+)"
行。
- 检测到不平衡的大括号并报告为错误。
- 您可以在 RegEx101 查看 RegEx 的工作原理并获得解释。
C#版本:
- 我已经创建了一个更快的 C# version of the parser。
- 它需要 PowerShell 7+。它可以使用
Add-Type -Path FileName.cs
导入并像 [zett42.ServerDataParser]::Parse($text)
. 一样调用
- 它不使用正则表达式。它基于
ReadOnlySpan<char>
并且仅使用简单的字符串操作。在我做的基准测试中,它比使用 RegEx 的 C# 版本快大约 10 倍,比 PS-only 版本快大约 60 倍。
我正在尝试解析我最初怀疑是来自服务器的 JSON 配置文件。
经过一些尝试,当我选择格式化程序 JavaScript 时,我能够在 Notepad++ 中导航和折叠各部分。
然而,我对如何将这些数据 convert/parse 转换为 JSON/another 格式感到困惑,没有任何在线工具能够帮助解决这个问题。
我该如何解析这段文字?理想情况下,我尝试使用 PowerShell,但如果我能弄清楚如何开始转换,Python 也是一个选择。
例如,我正在尝试解析每个服务器,即。 test1、test2、test3 并获取每个块中列出的数据。
这是配置文件格式的示例:
servername {
store {
servers {
* {
value<>
port<>
folder<C:\windows>
monitor<yes>
args<-T -H>
xrg<store>
wysargs<-t -g -b>
accept_any<yes>
pdu_length<23622>
}
test1 {
name<test1>
port<123>
root<c:\test>
monitor<yes>
}
test2 {
name<test2>
port<124>
root<c:\test>
monitor<yes>
}
test3 {
name<test3>
port<125>
root<c:\test>
monitor<yes>
}
}
senders
timeout<30>
}
}
这是将上述配置文件转换为 python 中的 dict/json 的东西。我只是按照@zett42 的建议做一些正则表达式。
import re
import json
lines = open('configfile', 'r').read()
# Quotations around the keys (next 3 lines)
lines2 = re.sub(r'([a-zA-Z\d_*]+)\s?{', r'"": {', lines)
# Process k<v> as Key, Value pairs
lines3 = re.sub(r'([a-zA-Z\d_*]+)\s?<([^<]*)>', r'"": ""', lines2)
# Process single key word on the line as Key, value pair with empty value
lines4 = re.sub(r'^\s*([a-zA-Z\d_*]+)\s*$', r'"": ""', lines3, flags=re.MULTILINE)
# Insert replace \n with commas in lines ending with "
lines5 = re.sub(r'"\n', '",', lines4)
# Remove the comma before the closing bracket
lines6 = re.sub(r',\s*}', '}', lines5)
# Remove quotes from numerical values
lines7 = re.sub(r'"(\d+)"', r'', lines6)
# Add commas after closing brackets when needed
lines8 = re.sub(r'[ \t\r\f]+(?!-)', '', lines7)
lines9 = re.sub(r'(?<=})\n(?=")', r",\n", lines8)
# Enclose in brackets and escape backslash for json parsing
lines10 = '{' + lines9.replace('\', '\\') + '}'
j = json.JSONDecoder().decode(lines10)
编辑: 这是一个可能更简洁的替代方案
# Replace line with just key with key<>
lines2 = re.sub(r'^([^{<>}]+)$', r'<>', lines, flags=re.MULTILINE)
# Remove spaces not within <>
lines3 = re.sub(r'\s(?!.*?>)|\s(?![^<]+>)', '', lines2, flags=re.MULTILINE)
# Quotations
lines4 = re.sub(r'([^{<>}]+)(?={)', r'"":', lines3)
lines5 = re.sub(r'([^:{<>}]+)<([^{<>}]*)>', r'"":""', lines4)
# Add commas
lines6 = re.sub(r'(?<=")"(?!")', ',"', lines5)
lines7 = re.sub(r'}(?!}|$)', '},', lines6)
# Remove quotes from numbers
lines8 = re.sub(r'"(\d+)"', r'', lines7)
# Escape \
lines9 = '{' + re.sub(r'\', r'\\', lines8) + '}'
编辑: 我想出了一个
我会保留这个答案,因为它可能对其他情况仍然有用。在性能上也可能存在差异(我没有测量)。
MYousefi already posted a helpful answer 实现 Python。
对于 PowerShell,我提出了一个无需 convert-to-JSON 步骤即可运行的解决方案。相反,我采用并概括了 RegEx-based tokenizer code from Jack Vanlightly (also see related blog post)。 tokenizer(又名 lexer)对输入文本的元素进行拆分和分类,并输出 tokens[=71 的平面流=](类别)和相关数据。 解析器 可以使用这些作为输入来创建输入文本的结构化表示。
分词器是用通用 C# 编写的,可用于任何可以使用 RegEx 拆分的输入。使用 Add-Type
命令将 C# 代码包含在 PowerShell 中,因此不需要 C# 编译器。
为了简单起见,解析器函数 ConvertFrom-ServerData
是用 PowerShell 编写的。您只需直接使用解析器,因此您不必了解分词器 C# 代码的任何信息。如果要将代码应用于不同的输入,只需修改 PowerShell 解析器代码即可。
将以下文件保存在与 PowerShell 脚本相同的目录中:
"RegExTokenizer.cs":
// Generic, precedence-based RegEx tokenizer.
// This code is based on https://github.com/Vanlightly/DslParser
// from Jack Vanlightly (https://jack-vanlightly.com).
// Modifications:
// - Interface improved for ease-of-use from PowerShell.
// - Return all groups from the RegEx match instead of just the value. This simplifies parsing of key/value pairs by requiring only a single token definition.
// - Some code simplifications, e. g. replacing "for" loops by "foreach".
using System;
using System.Collections.Generic;
using System.Linq;
using System.Text;
using System.Threading.Tasks;
using System.Text.RegularExpressions;
namespace DslTokenizer {
public class DslToken<TokenType> {
public TokenType Token { get; set; }
public GroupCollection Groups { get; set; }
}
public class TokenMatch<TokenType> {
public TokenType Token { get; set; }
public GroupCollection Groups { get; set; }
public int StartIndex { get; set; }
public int EndIndex { get; set; }
public int Precedence { get; set; }
}
public class TokenDefinition<TokenType> {
private Regex _regex;
private readonly TokenType _returnsToken;
private readonly int _precedence;
public TokenDefinition( TokenType returnsToken, string regexPattern, int precedence ) {
_regex = new Regex( regexPattern, RegexOptions.Multiline | RegexOptions.IgnoreCase | RegexOptions.Compiled );
_returnsToken = returnsToken;
_precedence = precedence;
}
public IEnumerable<TokenMatch<TokenType>> FindMatches( string inputString ) {
foreach( Match match in _regex.Matches( inputString ) ) {
yield return new TokenMatch<TokenType>() {
StartIndex = match.Index,
EndIndex = match.Index + match.Length,
Token = _returnsToken,
Groups = match.Groups,
Precedence = _precedence
};
}
}
}
public class PrecedenceBasedRegexTokenizer<TokenType> {
private List<TokenDefinition<TokenType>> _tokenDefinitions = new List<TokenDefinition<TokenType>>();
public PrecedenceBasedRegexTokenizer() {}
public PrecedenceBasedRegexTokenizer( IEnumerable<TokenDefinition<TokenType>> tokenDefinitions ) {
_tokenDefinitions = tokenDefinitions.ToList();
}
// Easy-to-use interface as alternative to constructor that takes an IEnumerable.
public void AddTokenDef( TokenType returnsToken, string regexPattern, int precedence = 0 ) {
_tokenDefinitions.Add( new TokenDefinition<TokenType>( returnsToken, regexPattern, precedence ) );
}
public IEnumerable<DslToken<TokenType>> Tokenize( string lqlText ) {
var tokenMatches = FindTokenMatches( lqlText );
var groupedByIndex = tokenMatches.GroupBy( x => x.StartIndex )
.OrderBy( x => x.Key )
.ToList();
TokenMatch<TokenType> lastMatch = null;
foreach( var match in groupedByIndex ) {
var bestMatch = match.OrderBy( x => x.Precedence ).First();
if( lastMatch != null && bestMatch.StartIndex < lastMatch.EndIndex ) {
continue;
}
yield return new DslToken<TokenType>(){ Token = bestMatch.Token, Groups = bestMatch.Groups };
lastMatch = bestMatch;
}
}
private List<TokenMatch<TokenType>> FindTokenMatches( string lqlText ) {
var tokenMatches = new List<TokenMatch<TokenType>>();
foreach( var tokenDefinition in _tokenDefinitions ) {
tokenMatches.AddRange( tokenDefinition.FindMatches( lqlText ).ToList() );
}
return tokenMatches;
}
}
}
用 PowerShell 编写的解析器函数:
$ErrorActionPreference = 'Stop'
Add-Type -TypeDefinition (Get-Content $PSScriptRoot\RegExTokenizer.cs -Raw)
Function ConvertFrom-ServerData {
[CmdletBinding()]
param (
[Parameter(Mandatory, ValueFromPipeline)] [string] $InputObject
)
begin {
# Define the kind of possible tokens.
enum ServerDataTokens {
ObjectBegin
ObjectEnd
ValueInt
ValueBool
ValueString
KeyOnly
}
# Create an instance of the tokenizer from "RegExTokenizer.cs".
$tokenizer = [DslTokenizer.PrecedenceBasedRegexTokenizer[ServerDataTokens]]::new()
# Define a RegEx for each token where 1st group matches key and 2nd matches value (if any).
# To resolve ambiguities, most specific RegEx must come first
# (e. g. ValueInt line must come before ValueString line).
# Alternatively pass a 3rd integer parameter that defines the precedence.
$tokenizer.AddTokenDef( [ServerDataTokens]::ObjectBegin, '^\s*([\w*]+)\s*{' )
$tokenizer.AddTokenDef( [ServerDataTokens]::ObjectEnd, '^\s*}\s*$' )
$tokenizer.AddTokenDef( [ServerDataTokens]::ValueInt, '^\s*(\w+)\s*<([+-]?\d+)>\s*$' )
$tokenizer.AddTokenDef( [ServerDataTokens]::ValueBool, '^\s*(\w+)\s*<(yes|no)>\s*$' )
$tokenizer.AddTokenDef( [ServerDataTokens]::ValueString, '^\s*(\w+)\s*<(.*)>\s*$' )
$tokenizer.AddTokenDef( [ServerDataTokens]::KeyOnly, '^\s*(\w+)\s*$' )
}
process {
# Output is an ordered hashtable
$outputObject = [ordered] @{}
$curObject = $outputObject
# A stack is used to keep track of nested objects.
$stack = [Collections.Stack]::new()
# For each token produced by the tokenizer
$tokenizer.Tokenize( $InputObject ).ForEach{
# $_.Groups[0] is the full match, which we discard by assigning to $null
$null, $key, $value = $_.Groups.Value
switch( $_.Token ) {
([ServerDataTokens]::ObjectBegin) {
$child = [ordered] @{}
$curObject[ $key ] = $child
$stack.Push( $curObject )
$curObject = $child
break
}
([ServerDataTokens]::ObjectEnd) {
$curObject = $stack.Pop()
break
}
([ServerDataTokens]::ValueInt) {
$intValue = 0
$curObject[ $key ] = if( [int]::TryParse( $value, [ref] $intValue ) ) { $intValue } else { $value }
break
}
([ServerDataTokens]::ValueBool) {
$curObject[ $key ] = $value -eq 'yes'
break
}
([ServerDataTokens]::ValueString) {
$curObject[ $key ] = $value
break
}
([ServerDataTokens]::KeyOnly) {
$curObject[ $key ] = $null
break
}
}
}
$outputObject # Implicit output
}
}
用法示例:
$sampleData = @'
servername {
store {
servers {
* {
value<>
port<>
folder<C:\windows>
monitor<yes>
args<-T -H>
xrg<store>
wysargs<-t -g -b>
accept_any<yes>
pdu_length<23622>
}
test1 {
name<test1>
port<123>
root<c:\test>
monitor<yes>
}
test2 {
name<test2>
port<124>
root<c:\test>
monitor<yes>
}
test3 {
name<test3>
port<125>
root<c:\test>
monitor<yes>
}
}
senders
timeout<30>
}
}
'@
# Call the parser
$objects = $sampleData | ConvertFrom-ServerData
# The parser outputs nested hashtables, so we have to use GetEnumerator() to
# iterate over the key/value pairs.
$objects.servername.store.servers.GetEnumerator().ForEach{
"[ SERVER: $($_.Key) ]"
# Convert server values hashtable to PSCustomObject for better output formatting
[PSCustomObject] $_.Value | Format-List
}
输出:
[ SERVER: * ]
value :
port :
folder : C:\windows
monitor : True
args : -T -H
xrg : store
wysargs : -t -g -b
accept_any : True
pdu_length : 23622
[ SERVER: test1 ]
name : test1
port : 123
root : c:\test
monitor : True
[ SERVER: test2 ]
name : test2
port : 124
root : c:\test
monitor : True
[ SERVER: test3 ]
name : test3
port : 125
root : c:\test
monitor : True
备注:
- 如果您将来自
Get-Content
的输入传递给解析器,请确保使用参数-Raw
,例如。 G。$objects = Get-Content input.cfg -Raw | ConvertFrom-ServerData
。否则解析器将尝试自己解析每个输入行。 - 我选择将“是”/“否”值转换为
bool
,因此它们输出为“真”/“假”。删除行$tokenizer.AddTokenDef( 'ValueBool', ...
以将它们解析为string
并输出 as-is. - 没有值
<>
的键(示例中的“发送者”)存储为具有值$null
. 的键
- RegEx 强制值只能是 single-line(如示例数据所示)。这允许我们嵌入
>
个字符而无需转义它们。
我想出了一个比
使用RegEx alternation operator |
we combine all token patterns into a single pattern and use named subexpressions判断实际匹配到哪一个
其余代码在结构上与 C#/PS 版本相似。
using namespace System.Text.RegularExpressions
$ErrorActionPreference = 'Stop'
Function ConvertFrom-ServerData {
[CmdletBinding()]
param (
[Parameter(Mandatory, ValueFromPipeline)] [string] $InputObject
)
begin {
# Key can consist of anything except whitespace and < > { }
$keyPattern = '[^\s<>{}]+'
# Order of the patterns is important
$pattern = (
"(?<IntKey>$keyPattern)\s*<(?<IntValue>\d+)>",
"(?<TrueKey>$keyPattern)\s*<yes>",
"(?<FalseKey>$keyPattern)\s*<no>",
"(?<StrKey>$keyPattern)\s*<(?<StrValue>.*?)>",
"(?<ObjectBegin>$keyPattern)\s*{",
"(?<ObjectEnd>})",
"(?<KeyOnly>$keyPattern)",
"(?<Invalid>\S+)" # any non-whitespace sequence that didn't match the valid patterns
) -join '|'
}
process {
# Output is an ordered hashtable
$curObject = $outputObject = [ordered] @{}
# A stack is used to keep track of nested objects.
$stack = [Collections.Stack]::new()
# For each pattern match
foreach( $match in [RegEx]::Matches( $InputObject, $pattern, [RegexOptions]::Multiline ) ) {
# Get the RegEx groups that have actually matched.
$matchGroups = $match.Groups.Where{ $_.Success -and $_.Name.Length -gt 1 }
$key = $matchGroups[ 0 ].Value
switch( $matchGroups[ 0 ].Name ) {
'ObjectBegin' {
$child = [ordered] @{}
$curObject[ $key ] = $child
$stack.Push( $curObject )
$curObject = $child
break
}
'ObjectEnd' {
if( $stack.Count -eq 0 ) {
Write-Error -EA Stop "Parse error: Curly braces are unbalanced. There are more '}' than '{' in config data."
}
$curObject = $stack.Pop()
break
}
'IntKey' {
$value = $matchGroups[ 1 ].Value
$intValue = 0
$curObject[ $key ] = if( [int]::TryParse( $value, [ref] $intValue ) ) { $intValue } else { $value }
break
}
'TrueKey' {
$curObject[ $key ] = $true
break
}
'FalseKey' {
$curObject[ $key ] = $false
break
}
'StrKey' {
$value = $matchGroups[ 1 ].Value
$curObject[ $key ] = $value
break
}
'KeyOnly' {
$curObject[ $key ] = $null
break
}
'Invalid' {
Write-Warning "Invalid token at index $($match.Index): $key"
break
}
}
}
if( $stack.Count -gt 0 ) {
Write-Error "Parse error: Curly braces are unbalanced. There are more '{' than '}' in config data."
}
$outputObject # Implicit output
}
}
用法示例:
$sampleData = @'
test-server {
store {
servers {
* {
value<>
port<>
folder<C:\windows> monitor<yes>
args<-T -H>
xrg<store>
wysargs<-t -g -b>
accept_any<yes>
pdu_length<23622>
}
test1 {
name<test1>
port<123>
root<c:\test>
monitor<yes>
}
test2 {
name<test2>
port<124>
root<c:\test>
monitor<yes>
}
test3 {
name<test3>
port<125>
root<c:\test>
monitor<yes>
}
}
senders
timeout<30>
}
}
'@
# Call the parser
$objects = $sampleData | ConvertFrom-ServerData
# Uncomment to verify the whole result
#$objects | ConvertTo-Json -Depth 10
# The parser outputs nested hashtables, so we have to use GetEnumerator() to
# iterate over the key/value pairs.
$objects.'test-server'.store.servers.GetEnumerator().ForEach{
"[ SERVER: $($_.Key) ]"
# Convert server values hashtable to PSCustomObject for better output formatting
[PSCustomObject] $_.Value | Format-List
}
输出:
[ SERVER: * ]
value :
port :
folder : C:\windows
monitor : True
args : -T -H
xrg : store
wysargs : -t -g -b
accept_any : True
pdu_length : 23622
[ SERVER: test1 ]
name : test1
port : 123
root : c:\test
monitor : True
[ SERVER: test2 ]
name : test2
port : 124
root : c:\test
monitor : True
[ SERVER: test3 ]
name : test3
port : 125
root : c:\test
monitor : True
备注:
- 我进一步放宽了正则表达式。键现在可以包含除空格之外的任何字符,
<
、>
、{
和}
. - 不再需要换行符。这更灵活,但你不能有嵌入
>
字符的字符串。让我知道这是否有问题。 - 我添加了对无效标记的检测,这些标记作为警告输出。如果您想忽略无效标记,请删除
"(?<Invalid>\S+)"
行。 - 检测到不平衡的大括号并报告为错误。
- 您可以在 RegEx101 查看 RegEx 的工作原理并获得解释。
C#版本:
- 我已经创建了一个更快的 C# version of the parser。
- 它需要 PowerShell 7+。它可以使用
Add-Type -Path FileName.cs
导入并像[zett42.ServerDataParser]::Parse($text)
. 一样调用
- 它不使用正则表达式。它基于
ReadOnlySpan<char>
并且仅使用简单的字符串操作。在我做的基准测试中,它比使用 RegEx 的 C# 版本快大约 10 倍,比 PS-only 版本快大约 60 倍。