如何将更复杂的面向人的文本输出解析为机器友好的风格?
how to parse more complex human-oriented text output to machine-friently style?
这是关于如何将 "unparseable" 输出解析为 json 或 json 等易于使用的内容的问题。这"little"有点后面琐碎的东西,所以我想知道,你如何原则上解决这些东西,不只是这个具体的例子。但是例子:
我们有这个命令,显示有关音频输入的数据:
pacmd list-sink-inputs
它打印出如下内容:
2 sink input(s) available.
index: 144
driver: <protocol-native.c>
flags:
state: RUNNING
sink: 4 <alsa_output.pci-0000_05_00.0.analog-stereo>
volume: front-left: 15728 / 24% / -37.19 dB, front-right: 15728 / 24% / -37.19 dB
balance 0.00
muted: no
current latency: 70.48 ms
requested latency: 210.00 ms
sample spec: float32le 2ch 44100Hz
channel map: front-left,front-right
Stereo
resample method: copy
module: 13
client: 245 <MPlayer>
properties:
media.name = "UNREAL! Tetris Theme on Violin and Guitar-TnDIRr9C83w.webm"
application.name = "MPlayer"
native-protocol.peer = "UNIX socket client"
native-protocol.version = "32"
application.process.id = "1543"
application.process.user = "mmucha"
application.process.host = "vbDesktop"
application.process.binary = "mplayer"
application.language = "C"
window.x11.display = ":0"
application.process.machine_id = "720184179caa46f0a3ce25156642f7a0"
application.process.session_id = "2"
module-stream-restore.id = "sink-input-by-application-name:MPlayer"
index: 145
driver: <protocol-native.c>
flags:
state: RUNNING
sink: 4 <alsa_output.pci-0000_05_00.0.analog-stereo>
volume: front-left: 24903 / 38% / -25.21 dB, front-right: 24903 / 38% / -25.21 dB
balance 0.00
muted: no
current latency: 70.50 ms
requested latency: 210.00 ms
sample spec: float32le 2ch 48000Hz
channel map: front-left,front-right
Stereo
resample method: speex-float-1
module: 13
client: 251 <MPlayer>
properties:
media.name = "Trombone Shorty At Age 13 - 2nd Line-k9YUi3UhEPQ.webm"
application.name = "MPlayer"
native-protocol.peer = "UNIX socket client"
native-protocol.version = "32"
application.process.id = "2831"
application.process.user = "mmucha"
application.process.host = "vbDesktop"
application.process.binary = "mplayer"
application.language = "C"
window.x11.display = ":0"
application.process.machine_id = "720184179caa46f0a3ce25156642f7a0"
application.process.session_id = "2"
module-stream-restore.id = "sink-input-by-application-name:MPlayer"
非常好。但我们不想向用户显示所有这些,我们只想以某种合理的格式显示索引(输入的 id)、application.process.id、application.name 和 media.name。将它 以某种方式 解析为 json 会很棒,但即使我以某种方式对其进行预处理,jq
也超出了我的能力范围并且非常复杂。我尝试了多种使用 jq 的方法,无论是否使用正则表达式,但我无法完成。而且我想我们不能依赖所有字段的顺序或存在。
我能够得到工作 "done",但它很混乱,效率低下,即媒体名称或应用程序名称中没有分号。不可接受的解决方案,但这是我能够带给 "end".
的唯一东西
错误的解决方案:
cat exampleOf2Inputs |
grep -e "index: \|application.process.id = \|application.name = \|media.name = " |
sed "s/^[ \t]*//;s/^\([^=]*\) = /: /" |
tr "\n" ";" |
sed "s/$/\n/;s/index:/\nindex:/g" |
tail -n +2 |
while read A; do
index=$(echo $A|sed "s/^index: \([0-9]*\).*//");
pid=$(echo $A|sed 's/^.*application\.process\.id: \"\([0-9]*\)\".*$//');
appname=$(echo $A|sed 's/^.*application\.name: \"\([^;]*\)\".*$//');
medianame=$(echo $A|sed 's/^.*media\.name: \"\([^;]*\)\".*$/\"\"/');
echo "pid=$pid index=$index appname=$appname medianame=$medianame";
done
~我grep了有趣的部分,用分号替换换行符,分成多行,然后使用sed多次提取数据。疯了。
这里的输出是:
pid=1543 index=144 appname=MPlayer medianame="UNREAL! Tetris Theme on Violin and Guitar-TnDIRr9C83w.webm"
pid=2831 index=145 appname=MPlayer medianame="Trombone Shorty At Age 13 - 2nd Line-k9YUi3UhEPQ.webm"
可以轻松转换为任何格式,但问题是关于 json,因此:
[
{
"pid": 1543,
"index": 144,
"appname": "MPlayer",
"medianame": "UNREAL! Tetris Theme on Violin and Guitar-TnDIRr9C83w.webm"
},
{
"pid": 2831,
"index": 145,
"appname": "MPlayer",
"medianame": "Trombone Shorty At Age 13 - 2nd Line-k9YUi3UhEPQ.webm"
}
]
现在我想看看,这些事情是如何正确完成的。
我不知道 "correctly",但我会这样做:
pacmd list-sink-inputs | awk '
BEGIN { print "[" }
function print_record() {
if (count++) {
print " {"
printf " %s,\n", print_number("pid")
printf " %s,\n", print_number("index")
printf " %s,\n", print_string("appname")
printf " %s\n", print_string("medianame")
print " },"
}
delete record
}
function print_number(key) { return sprintf("\"%s\": %d", key, record[key]) }
function print_string(key) { return sprintf("\"%s\": \"%s\"", key, record[key]) }
function get_quoted_value() {
if (match([=10=], /[^"]+"$/))
return substr([=10=], RSTART, RLENGTH-1)
else
return "?"
}
== "index:" { print_record(); record["index"] = }
== "application.process.id" { record["pid"] = get_quoted_value() }
== "application.name" { record["appname"] = get_quoted_value() }
== "media.name" { record["medianame"] = get_quoted_value() }
END { print_record(); print "]" }
' |
tac | awk '/},$/ && !seen++ {sub(/,$/,"")} 1' | tac
其中 tac|awk|tac
行删除了列表中 last JSON 对象的尾随逗号。
[
{
"pid": 1543,
"index": 144,
"appname": "MPlayer",
"medianame": "UNREAL! Tetris Theme on Violin and Guitar-TnDIRr9C83w.webm"
},
{
"pid": 2831,
"index": 145,
"appname": "MPlayer",
"medianame": "Trombone Shorty At Age 13 - 2nd Line-k9YUi3UhEPQ.webm"
}
]
您可以将输出通过管道传输到:
sed -E '
s/pid=([0-9]+) index=([0-9]+) appname=([^ ]+) medianame=(.*)/{"pid": , "index": , "appname": "", "medianame": },/
1s/^/[/
$s/,$/]/
' | jq .
如果输入如Q所示合理,下面只用jq的做法应该是可以的。
假设调用如下:
jq -nR -f parse.jq input.txt
def parse:
def interpret:
if . == null then .
elif startswith("\"") and endswith("\"")
then .[1:-1]
else tonumber? // .
end;
(capture( "(?<key>[^\t:= ]*)(: | = )(?<value>.*)" ) // null)
| if . then .value = (.value | interpret) else . end
;
# Construct one object for each "segment"
def construct(s):
[ foreach (s, 0) as $kv (null;
if $kv == 0 or $kv.index
then .complete = .accumulator | .accumulator = $kv
else .complete = null | .accumulator += $kv
end;
.complete // empty ) ]
;
construct(inputs | parse | select(.) | {(.key):.value})
| map( {pid: .["application.process.id"],
index,
appname: .["application.name"],
medianame: .["media.name"]} )
使用示例输入,输出将是:
[
{
"pid": "1543",
"index": 144,
"appname": "MPlayer",
"medianame": "UNREAL! Tetris Theme on Violin and Guitar-TnDIRr9C83w.webm"
},
{
"pid": "2831",
"index": 145,
"appname": "MPlayer",
"medianame": "Trombone Shorty At Age 13 - 2nd Line-k9YUi3UhEPQ.webm"
}
]
简要说明
parse
解析一行。它假定可以忽略键名之前每一行的空格(空白和制表符)。
construct
负责对对应于“索引”的特定值的行(呈现为键值单键对象流)进行分组。它生成一组对象,每个对象对应“索引”的每个值。
这是关于如何将 "unparseable" 输出解析为 json 或 json 等易于使用的内容的问题。这"little"有点后面琐碎的东西,所以我想知道,你如何原则上解决这些东西,不只是这个具体的例子。但是例子:
我们有这个命令,显示有关音频输入的数据:
pacmd list-sink-inputs
它打印出如下内容:
2 sink input(s) available.
index: 144
driver: <protocol-native.c>
flags:
state: RUNNING
sink: 4 <alsa_output.pci-0000_05_00.0.analog-stereo>
volume: front-left: 15728 / 24% / -37.19 dB, front-right: 15728 / 24% / -37.19 dB
balance 0.00
muted: no
current latency: 70.48 ms
requested latency: 210.00 ms
sample spec: float32le 2ch 44100Hz
channel map: front-left,front-right
Stereo
resample method: copy
module: 13
client: 245 <MPlayer>
properties:
media.name = "UNREAL! Tetris Theme on Violin and Guitar-TnDIRr9C83w.webm"
application.name = "MPlayer"
native-protocol.peer = "UNIX socket client"
native-protocol.version = "32"
application.process.id = "1543"
application.process.user = "mmucha"
application.process.host = "vbDesktop"
application.process.binary = "mplayer"
application.language = "C"
window.x11.display = ":0"
application.process.machine_id = "720184179caa46f0a3ce25156642f7a0"
application.process.session_id = "2"
module-stream-restore.id = "sink-input-by-application-name:MPlayer"
index: 145
driver: <protocol-native.c>
flags:
state: RUNNING
sink: 4 <alsa_output.pci-0000_05_00.0.analog-stereo>
volume: front-left: 24903 / 38% / -25.21 dB, front-right: 24903 / 38% / -25.21 dB
balance 0.00
muted: no
current latency: 70.50 ms
requested latency: 210.00 ms
sample spec: float32le 2ch 48000Hz
channel map: front-left,front-right
Stereo
resample method: speex-float-1
module: 13
client: 251 <MPlayer>
properties:
media.name = "Trombone Shorty At Age 13 - 2nd Line-k9YUi3UhEPQ.webm"
application.name = "MPlayer"
native-protocol.peer = "UNIX socket client"
native-protocol.version = "32"
application.process.id = "2831"
application.process.user = "mmucha"
application.process.host = "vbDesktop"
application.process.binary = "mplayer"
application.language = "C"
window.x11.display = ":0"
application.process.machine_id = "720184179caa46f0a3ce25156642f7a0"
application.process.session_id = "2"
module-stream-restore.id = "sink-input-by-application-name:MPlayer"
非常好。但我们不想向用户显示所有这些,我们只想以某种合理的格式显示索引(输入的 id)、application.process.id、application.name 和 media.name。将它 以某种方式 解析为 json 会很棒,但即使我以某种方式对其进行预处理,jq
也超出了我的能力范围并且非常复杂。我尝试了多种使用 jq 的方法,无论是否使用正则表达式,但我无法完成。而且我想我们不能依赖所有字段的顺序或存在。
我能够得到工作 "done",但它很混乱,效率低下,即媒体名称或应用程序名称中没有分号。不可接受的解决方案,但这是我能够带给 "end".
的唯一东西错误的解决方案:
cat exampleOf2Inputs |
grep -e "index: \|application.process.id = \|application.name = \|media.name = " |
sed "s/^[ \t]*//;s/^\([^=]*\) = /: /" |
tr "\n" ";" |
sed "s/$/\n/;s/index:/\nindex:/g" |
tail -n +2 |
while read A; do
index=$(echo $A|sed "s/^index: \([0-9]*\).*//");
pid=$(echo $A|sed 's/^.*application\.process\.id: \"\([0-9]*\)\".*$//');
appname=$(echo $A|sed 's/^.*application\.name: \"\([^;]*\)\".*$//');
medianame=$(echo $A|sed 's/^.*media\.name: \"\([^;]*\)\".*$/\"\"/');
echo "pid=$pid index=$index appname=$appname medianame=$medianame";
done
~我grep了有趣的部分,用分号替换换行符,分成多行,然后使用sed多次提取数据。疯了。
这里的输出是:
pid=1543 index=144 appname=MPlayer medianame="UNREAL! Tetris Theme on Violin and Guitar-TnDIRr9C83w.webm"
pid=2831 index=145 appname=MPlayer medianame="Trombone Shorty At Age 13 - 2nd Line-k9YUi3UhEPQ.webm"
可以轻松转换为任何格式,但问题是关于 json,因此:
[
{
"pid": 1543,
"index": 144,
"appname": "MPlayer",
"medianame": "UNREAL! Tetris Theme on Violin and Guitar-TnDIRr9C83w.webm"
},
{
"pid": 2831,
"index": 145,
"appname": "MPlayer",
"medianame": "Trombone Shorty At Age 13 - 2nd Line-k9YUi3UhEPQ.webm"
}
]
现在我想看看,这些事情是如何正确完成的。
我不知道 "correctly",但我会这样做:
pacmd list-sink-inputs | awk '
BEGIN { print "[" }
function print_record() {
if (count++) {
print " {"
printf " %s,\n", print_number("pid")
printf " %s,\n", print_number("index")
printf " %s,\n", print_string("appname")
printf " %s\n", print_string("medianame")
print " },"
}
delete record
}
function print_number(key) { return sprintf("\"%s\": %d", key, record[key]) }
function print_string(key) { return sprintf("\"%s\": \"%s\"", key, record[key]) }
function get_quoted_value() {
if (match([=10=], /[^"]+"$/))
return substr([=10=], RSTART, RLENGTH-1)
else
return "?"
}
== "index:" { print_record(); record["index"] = }
== "application.process.id" { record["pid"] = get_quoted_value() }
== "application.name" { record["appname"] = get_quoted_value() }
== "media.name" { record["medianame"] = get_quoted_value() }
END { print_record(); print "]" }
' |
tac | awk '/},$/ && !seen++ {sub(/,$/,"")} 1' | tac
其中 tac|awk|tac
行删除了列表中 last JSON 对象的尾随逗号。
[
{
"pid": 1543,
"index": 144,
"appname": "MPlayer",
"medianame": "UNREAL! Tetris Theme on Violin and Guitar-TnDIRr9C83w.webm"
},
{
"pid": 2831,
"index": 145,
"appname": "MPlayer",
"medianame": "Trombone Shorty At Age 13 - 2nd Line-k9YUi3UhEPQ.webm"
}
]
您可以将输出通过管道传输到:
sed -E '
s/pid=([0-9]+) index=([0-9]+) appname=([^ ]+) medianame=(.*)/{"pid": , "index": , "appname": "", "medianame": },/
1s/^/[/
$s/,$/]/
' | jq .
如果输入如Q所示合理,下面只用jq的做法应该是可以的。
假设调用如下:
jq -nR -f parse.jq input.txt
def parse:
def interpret:
if . == null then .
elif startswith("\"") and endswith("\"")
then .[1:-1]
else tonumber? // .
end;
(capture( "(?<key>[^\t:= ]*)(: | = )(?<value>.*)" ) // null)
| if . then .value = (.value | interpret) else . end
;
# Construct one object for each "segment"
def construct(s):
[ foreach (s, 0) as $kv (null;
if $kv == 0 or $kv.index
then .complete = .accumulator | .accumulator = $kv
else .complete = null | .accumulator += $kv
end;
.complete // empty ) ]
;
construct(inputs | parse | select(.) | {(.key):.value})
| map( {pid: .["application.process.id"],
index,
appname: .["application.name"],
medianame: .["media.name"]} )
使用示例输入,输出将是:
[
{
"pid": "1543",
"index": 144,
"appname": "MPlayer",
"medianame": "UNREAL! Tetris Theme on Violin and Guitar-TnDIRr9C83w.webm"
},
{
"pid": "2831",
"index": 145,
"appname": "MPlayer",
"medianame": "Trombone Shorty At Age 13 - 2nd Line-k9YUi3UhEPQ.webm"
}
]
简要说明
parse
解析一行。它假定可以忽略键名之前每一行的空格(空白和制表符)。
construct
负责对对应于“索引”的特定值的行(呈现为键值单键对象流)进行分组。它生成一组对象,每个对象对应“索引”的每个值。