如何将更复杂的面向人的文本输出解析为机器友好的风格?

how to parse more complex human-oriented text output to machine-friently style?

这是关于如何将 "unparseable" 输出解析为 json 或 json 等易于使用的内容的问题。这"little"有点后面琐碎的东西,所以我想知道,你如何原则上解决这些东西,不只是这个具体的例子。但是例子:

我们有这个命令,显示有关音频输入的数据:

pacmd list-sink-inputs

它打印出如下内容:

2 sink input(s) available.
    index: 144
    driver: <protocol-native.c>
    flags: 
    state: RUNNING
    sink: 4 <alsa_output.pci-0000_05_00.0.analog-stereo>
    volume: front-left: 15728 /  24% / -37.19 dB,   front-right: 15728 /  24% / -37.19 dB
            balance 0.00
    muted: no
    current latency: 70.48 ms
    requested latency: 210.00 ms
    sample spec: float32le 2ch 44100Hz
    channel map: front-left,front-right
                 Stereo
    resample method: copy
    module: 13
    client: 245 <MPlayer>
    properties:
        media.name = "UNREAL! Tetris Theme on Violin and Guitar-TnDIRr9C83w.webm"
        application.name = "MPlayer"
        native-protocol.peer = "UNIX socket client"
        native-protocol.version = "32"
        application.process.id = "1543"
        application.process.user = "mmucha"
        application.process.host = "vbDesktop"
        application.process.binary = "mplayer"
        application.language = "C"
        window.x11.display = ":0"
        application.process.machine_id = "720184179caa46f0a3ce25156642f7a0"
        application.process.session_id = "2"
        module-stream-restore.id = "sink-input-by-application-name:MPlayer"
    index: 145
    driver: <protocol-native.c>
    flags: 
    state: RUNNING
    sink: 4 <alsa_output.pci-0000_05_00.0.analog-stereo>
    volume: front-left: 24903 /  38% / -25.21 dB,   front-right: 24903 /  38% / -25.21 dB
            balance 0.00
    muted: no
    current latency: 70.50 ms
    requested latency: 210.00 ms
    sample spec: float32le 2ch 48000Hz
    channel map: front-left,front-right
                 Stereo
    resample method: speex-float-1
    module: 13
    client: 251 <MPlayer>
    properties:
        media.name = "Trombone Shorty At Age 13 - 2nd Line-k9YUi3UhEPQ.webm"
        application.name = "MPlayer"
        native-protocol.peer = "UNIX socket client"
        native-protocol.version = "32"
        application.process.id = "2831"
        application.process.user = "mmucha"
        application.process.host = "vbDesktop"
        application.process.binary = "mplayer"
        application.language = "C"
        window.x11.display = ":0"
        application.process.machine_id = "720184179caa46f0a3ce25156642f7a0"
        application.process.session_id = "2"
        module-stream-restore.id = "sink-input-by-application-name:MPlayer"

非常好。但我们不想向用户显示所有这些,我们只想以某种合理的格式显示索引(输入的 id)、application.process.id、application.name 和 media.name。将它 以某种方式 解析为 json 会很棒,但即使我以某种方式对其进行预处理,jq 也超出了我的能力范围并且非常复杂。我尝试了多种使用 jq 的方法,无论是否使用正则表达式,但我无法完成。而且我想我们不能依赖所有字段的顺序或存在。

我能够得到工作 "done",但它很混乱,效率低下,即媒体名称或应用程序名称中没有分号。不可接受的解决方案,但这是我能够带给 "end".

的唯一东西

错误的解决方案:

cat exampleOf2Inputs | 
grep -e "index: \|application.process.id = \|application.name = \|media.name = " | 
sed "s/^[ \t]*//;s/^\([^=]*\) = /: /" | 
tr "\n" ";" | 
sed "s/$/\n/;s/index:/\nindex:/g" | 
tail -n +2 | 
while read A; do 
index=$(echo $A|sed "s/^index: \([0-9]*\).*//");
pid=$(echo $A|sed 's/^.*application\.process\.id: \"\([0-9]*\)\".*$//'); 
appname=$(echo $A|sed 's/^.*application\.name: \"\([^;]*\)\".*$//'); 
medianame=$(echo $A|sed 's/^.*media\.name: \"\([^;]*\)\".*$/\"\"/'); 

echo "pid=$pid index=$index appname=$appname medianame=$medianame"; 
done

~我grep了有趣的部分,用分号替换换行符,分成多行,然后使用sed多次提取数据。疯了。

这里的输出是:

pid=1543 index=144 appname=MPlayer medianame="UNREAL! Tetris Theme on Violin and Guitar-TnDIRr9C83w.webm"
pid=2831 index=145 appname=MPlayer medianame="Trombone Shorty At Age 13 - 2nd Line-k9YUi3UhEPQ.webm"

可以轻松转换为任何格式,但问题是关于 json,因此:

[
  {
    "pid": 1543,
    "index": 144,
    "appname": "MPlayer",
    "medianame": "UNREAL! Tetris Theme on Violin and Guitar-TnDIRr9C83w.webm"
  },
  {
    "pid": 2831,
    "index": 145,
    "appname": "MPlayer",
    "medianame": "Trombone Shorty At Age 13 - 2nd Line-k9YUi3UhEPQ.webm"
  }
]

现在我想看看,这些事情是如何正确完成的。

我不知道 "correctly",但我会这样做:

pacmd list-sink-inputs | awk '
    BEGIN { print "[" }
    function print_record() {
        if (count++) {
            print "  {"
            printf "    %s,\n", print_number("pid")
            printf "    %s,\n", print_number("index")
            printf "    %s,\n", print_string("appname")
            printf "    %s\n",  print_string("medianame")
            print "  },"
        }
        delete record
    }
    function print_number(key) { return sprintf("\"%s\": %d", key, record[key]) }
    function print_string(key) { return sprintf("\"%s\": \"%s\"", key, record[key]) }
    function get_quoted_value() {
        if (match([=10=], /[^"]+"$/))
            return substr([=10=], RSTART, RLENGTH-1)
        else
            return "?"
    }
     == "index:" { print_record(); record["index"] =  }
     == "application.process.id" { record["pid"]       = get_quoted_value() }
     == "application.name"       { record["appname"]   = get_quoted_value() }
     == "media.name"             { record["medianame"] = get_quoted_value() }
    END { print_record(); print "]" }
' | 
  tac | awk '/},$/ && !seen++ {sub(/,$/,"")} 1' | tac

其中 tac|awk|tac 行删除了列表中 last JSON 对象的尾随逗号。

[
  {
    "pid": 1543,
    "index": 144,
    "appname": "MPlayer",
    "medianame": "UNREAL! Tetris Theme on Violin and Guitar-TnDIRr9C83w.webm"
  },
  {
    "pid": 2831,
    "index": 145,
    "appname": "MPlayer",
    "medianame": "Trombone Shorty At Age 13 - 2nd Line-k9YUi3UhEPQ.webm"
  }
]

您可以将输出通过管道传输到:

sed -E '
  s/pid=([0-9]+) index=([0-9]+) appname=([^ ]+) medianame=(.*)/{"pid": , "index": , "appname": "", "medianame": },/
  1s/^/[/
  $s/,$/]/
' | jq .

如果输入如Q所示合理,下面只用jq的做法应该是可以的。

假设调用如下:

jq -nR -f parse.jq input.txt
def parse:
  def interpret:
    if . == null then .
    elif startswith("\"") and endswith("\"")
    then  .[1:-1]
    else tonumber? // .
    end;
  (capture( "(?<key>[^\t:= ]*)(: | = )(?<value>.*)" ) // null)
  | if . then .value = (.value | interpret) else . end
;

# Construct one object for each "segment"  
def construct(s): 
  [ foreach (s, 0) as $kv (null;
      if $kv == 0 or $kv.index
      then .complete = .accumulator | .accumulator = $kv
      else .complete = null | .accumulator += $kv
      end;
      .complete // empty ) ]
;


construct(inputs | parse | select(.) | {(.key):.value})
| map( {pid: .["application.process.id"],
        index,
        appname: .["application.name"],
        medianame: .["media.name"]} )

使用示例输入,输出将是:

[
  {
    "pid": "1543",
    "index": 144,
    "appname": "MPlayer",
    "medianame": "UNREAL! Tetris Theme on Violin and Guitar-TnDIRr9C83w.webm"
  },
  {
    "pid": "2831",
    "index": 145,
    "appname": "MPlayer",
    "medianame": "Trombone Shorty At Age 13 - 2nd Line-k9YUi3UhEPQ.webm"
  }
]

简要说明

parse 解析一行。它假定可以忽略键名之前每一行的空格(空白和制表符)。

construct 负责对对应于“索引”的特定值的行(呈现为键值单键对象流)进行分组。它生成一组对象,每个对象对应“索引”的每个值。