从文本文件中提取结构化数据(awk?):缺少的字段必须获得默认值

extract structured data from text files (awk ?) : missing fields must get default value

(我正在使用 macOS)。

我在子文件夹中有 70k 个文本文件,我想从中递归地提取一些数据,然后——如果可能的话——将输出写入一个制表符分隔的文件中,以供以后的电子表格处理。 来自我的 wiki(我使用 PmWiki,它在 text files 中保存数据)的文件在完成时以这种方式格式化(为了便于阅读,删除了不需要的数据):

version=
agent=
author=
charset=
csum=
ctime=1041379201
description=
host=
name=Name.12
rev=3
targets=Target.1,OtherTarget.23,Target.90
text=
time=
title=My title
author:
csum:
diff:
host:
author:
csum:
diff:

我想为名为 ctime name rev targets title 的字段(5 个字段)提取用 = 分隔的数据。

我的主要问题是如何获取数据(键 ctime= rev= targets= name= title=),以及在某些缺失时使用默认值?

我认为必须测试每个目标键是否存在;如果缺少,则使用默认值创建它;然后提取所需的字段值,最后将数据制成表格。

预期输出将以制表符分隔;丢失的数据将被命名为以后容易捕捉的东西。 即,对于示例中给出的完整文件(制表符代替空格),输出将给出类似 (ctime, rev, name, title, targets) :

1041379201 3 Name.12 my title Target.1,OtherTarget.23,Target.90

并且,对于不完整的文件(缺少字段,在第 1 行,是 rev;在第 2 行,rev 和标题):

1041379201 XXX Name.12 my title Target.1,OtherTarget.23,Target.90
1041379201 XXX Name.12 XXX Target.1,OtherTarget.23,Target.90

最终项目是能够每月提取一次数据,然后拥有易于在电子表格中使用的文本文件,每月更新一次。

我最糟糕的尝试就是这样(但根本不起作用,缺少 if/else 条件):

awk 'BEGIN { FS = "=" ;} /^ctime=/ {
                print 
                next
                }
/^rev=/ {
                print 
                next}
/^name=/ {
                print 
                next}
/^title=/ {
                print 
                next}
/^targets=/ {
                print 
                next}'

这是一个原始的 PmWiki 文件(在那种情况下我仍然想提取 ctime name rev targets title(并为缺少的字段设置默认值,ctimetitle):

version=pmwiki-2.2.64 ordered=1 urlencoded=1
author=simon
charset=UTF-8
csum=add summary
name=Main.HomePage
rev=203
targets=PmWiki.DocumentationIndex,PmWiki.InitialSetupTasks,PmWiki.BasicEditing,Main.WikiSandbox
text=(:Summary:The default home page for the PmWiki distribution:)%0aWelcome to PmWiki!%0a%0aA local copy of PmWiki's%0adocumentation has been installed along with the software,%0aand is available via the [[PmWiki/documentation index]].  %0a%0aTo continue setting up PmWiki, see [[PmWiki/initial setup tasks]].%0a%0aThe [[PmWiki/basic editing]] page describes how to create pages%0ain PmWiki.  You can practice editing in the [[wiki sandbox]].%0a%0aMore information about PmWiki is available from [[http://www.pmwiki.org]].%0a
time=1400472661

正在更新我的问题。

我发布问题的方式可能看起来比实际情况更复杂。 由此,在 70k 文本文件中重复:

word1=line1
word2=line2
word3=line3
...

我想要收集每个 line1, line3, lineX 的文件(用于针对 word1、word2、wordX 的命令),并且在 word1=line1 或 word2=line2 或 wordX=lineX 不存在的情况下具有默认值完全没有。

最后,我发现 Rick Smith 对 Retrieve default value with grep -e?

的回答非常接近我的需要

我刚刚注意到您说您只想打印特定标签的值,这使事情变得更容易。将 GNU awk 用于 ENDFILEgensub():

$ cat tst.awk
BEGIN {
    OFS="\t"
    numTags = split("ctime rev targets name title",tags)

    for (tagNr=1; tagNr<=numTags; tagNr++) {
        tag = tags[tagNr]
        printf "%s%s", tag, (tagNr<numTags ? OFS : ORS)
    }
}

match([=10=],/^([[:alnum:]_]+)[=:](.*)/,a) {
    tag = a[1]
    val = gensub(" ?" OFS " ?"," ","g",a[2])
    tag2val[tag] = val
}

ENDFILE {
    for (tagNr=1; tagNr<=numTags; tagNr++) {
        tag = tags[tagNr]
        val = ( tag in tag2val ? tag2val[tag] : "_ABSENT_" )
        val = ( val == "" ? "_NULL_" : val )
        printf "%s%s", val, (tagNr<numTags ? OFS : ORS)
    }
    delete tag2val
}

$ awk -f tst.awk file
ctime   rev     targets name    title
1041379201      3       Target.1,OtherTarget.23,Target.90       Name.12 My title

$ awk -f tst.awk file | column -s$'\t' -t
ctime       rev  targets                            name     title
1041379201  3    Target.1,OtherTarget.23,Target.90  Name.12  My title

原回答:

如果每个输入文件中的标签都是唯一的,这听起来可能就是您要尝试执行的操作,需要 GNU awk 进行多个扩展:

$ cat tst.awk
BEGIN { OFS="\t" }
match([=13=],/^([[:alnum:]_]+)[=:](.*)/,a) {
    tag = a[1]
    val = gensub(" ?" OFS " ?"," ","g",a[2])

    if ( !seen[tag]++ ) {
        tags[++numTags] = tag
    }

    key2val[ARGIND,tag] = val
}
END {
    for (tagNr=1; tagNr<=numTags; tagNr++) {
        tag = tags[tagNr]
        printf "%s%s", tag, (tagNr<numTags ? OFS : ORS)
    }

    for ( fileNr=1; fileNr<=ARGIND; fileNr++) {
        for (tagNr=1; tagNr<=numTags; tagNr++) {
            tag = tags[tagNr]
            key = fileNr SUBSEP tag
            val = ( key in key2val ? key2val[key] : "_ABSENT_" )
            val = ( val == "" ? "_NULL_" : val )
            printf "%s%s", val, (tagNr<numTags ? OFS : ORS)
        }
    }
}

$ awk -f tst.awk file
version agent   author  charset csum    ctime   description     host    name    rev     targets text    time    title   diff
_NULL_  _NULL_  _NULL_  _NULL_  _NULL_  1041379201      _NULL_  _NULL_  Name.12 3       Target.1,OtherTarget.23,Target.90       _NULL_   _NULL_  My title        _NULL_

要查看视觉对齐的列:

$ awk -f tst.awk file | column -s$'\t' -t
version  agent   author  charset  csum    ctime       description  host    name     rev  targets                            text    time    title     diff
_NULL_   _NULL_  _NULL_  _NULL_   _NULL_  1041379201  _NULL_       _NULL_  Name.12  3    Target.1,OtherTarget.23,Target.90  _NULL_  _NULL_  My title  _NULL_

只需 运行 一次在您的所有文件上,如下所示:

awk -f tst.awk file1 file2 etc.

它会找出所有文件中的所有标签,然后打印一个 TSV,其中包含所有这些文件中所有这些标签的值。

假设:

  • 一个输入文件至少有一行
  • field=value 条目不跨越多行(即,fieldvalue 都不包括嵌入的 linefeeds/carriage-returns)
  • fieldvalue 都不包含 = 字符(即,= 每个输入行只显示一次)
  • OP 可以创建一个包含所需字段及其默认值列表的新文件[这消除了对字段名称、它们的顺序和它们的默认值进行硬编码的需要]

示例输入文件:

$ cat 1.txt
version=
agent=
author=
charset=
csum=
ctime=1041379201
description=
host=
name=Name.12
rev=3
targets=Target.1,OtherTarget.23,Target.90
text=
time=
title=My title
author:
csum:
diff:
host:
author:
csum:
diff:

$ cat 2.txt
version=pmwiki-2.2.64 ordered=1 urlencoded=1
author=simon
charset=UTF-8
csum=add summary
name=Main.HomePage
rev=203
targets=PmWiki.DocumentationIndex,PmWiki.InitialSetupTasks,PmWiki.BasicEditing,Main.WikiSandbox
text=(:Summary:The default home page for the PmWiki distribution:)%0aWelcome to PmWiki!%0a%0aA local copy of PmWiki's%0adocumentation has been installed along with the software,%0aand is available via the [[PmWiki/documentation index]].  %0a%0aTo continue setting up PmWiki, see [[PmWiki/initial setup tasks]].%0a%0aThe [[PmWiki/basic editing]] page describes how to create pages%0ain PmWiki.  You can practice editing in the [[wiki sandbox]].%0a%0aMore information about PmWiki is available from [[http://www.pmwiki.org]].%0a
time=1400472661

$ cat 3.txt                # NOTE: no matches with fields in defaults.txt
other=abc
line=def

假设 OP 可以创建一个包含所需字段名称和默认值的文件,例如:

$ cat defaults.txt
ctime=CCCC
name=NNNN
rev=REV
targets=NO_TARGETS
title='BLANK TITLE'

注意:最终输出中字段的顺序与defaults.txt

中字段的顺序相同

一个awk想法:

awk -F'=' '

function print_line() {
    pfx=""
    if ( printme )                       # skip the first call to this function
       for ( i=1; i<=ordno; i++ ) {      # loop through our list of desired fields ...

           printf "%s%s", pfx, ( order[i] in fields ? fields[order[i]] : defaults[order[i]] )
           pfx=OFS
       }

    print ""                             # terminate line
    delete fields                        # reset our fields[] array
    printme=1                            # enable printing of fields[] contents on next call
}

BEGIN          { OFS="\t"                # output field delimiter
                 printme=0               # disable printing of fields[] on first function call
               }

FNR==NR        {                         # process 1st file, ie, our desired fields and their associated default values
                 order[++ordno]=       # save order of fields
                 defaults[]=         # save default values
                 next
               }

FNR==1         { print_line()            # upon seeing a new file flush the contents of fields[] to stdout
                 print "#### "FILENAME   # remove this line once OP validates output
               }

 in defaults { fields[]= }         # if field #1 is in our default[] array then save field #2 in our fields[] array

END            { print_line() }          # flush last file/fields[] to stdout

' defaults.txt 1.txt 2.txt 3.txt

注意:我无法访问 MacOS/awk 安装,因此 OP 需要确定这是否适用于他们的环境

这会生成:

#### 1.txt
1041379201      Name.12 3       Target.1,OtherTarget.23,Target.90       My title
#### 2.txt
CCCC    Main.HomePage   203     PmWiki.DocumentationIndex,PmWiki.InitialSetupTasks,PmWiki.BasicEditing,Main.WikiSandbox 'BLANK TITLE'
#### 3.txt
CCCC    NNNN    REV     NO_TARGETS      'BLANK TITLE'

没有 print "#### "FILENAME:

1041379201      Name.12 3       Target.1,OtherTarget.23,Target.90       My title
CCCC    Main.HomePage   203     PmWiki.DocumentationIndex,PmWiki.InitialSetupTasks,PmWiki.BasicEditing,Main.WikiSandbox 'BLANK TITLE'
CCCC    NNNN    REV     NO_TARGETS      'BLANK TITLE'