(g)awk 下一个文件在部分空行

(g)awk next file on partially blank line

问题

我只需要合并一大堆文件并从第一个文件中删除 header(第 1 行)。

数据

以下是其中三个文件的最后三行(第 1 行:header):

"START_DATE","END_DATE","UNITS","COST","COST_CURRENCY","AMOUNT"
"20170101","20170131","1","5.49","EUR","5.49"
"20170101","20170131","1","4.27","EUR","4.27"
"","","","","9.76",""

"START_DATE","END_DATE","UNITS","COST","COST_CURRENCY","AMOUNT"
"20170201","20170228","1","5.49","EUR","5.49"
"20170201","20170228","1","4.88","EUR","4.88"
"20170201","20170228","1","0.61","EUR","0.61"
"20170201","20170228","1","0.61","EUR","0.61"
"","","","","11.59",""

START_DATE","END_DATE","UNITS","COST","COST_CURRENCY","AMOUNT"
"20170301","20170331","1","4.88","EUR","4.88"
"20170301","20170331","1","4.27","EUR","4.27"
"","","","","9.15",""

问题(续)

如您所见,最后一行在第 5 列中有一个数字(它是列总数)。当然,我不想要最后一行。但它(显然)在每个文件的不同行号上。

(G)awk 显然是解决方案,但我不知道 (g)awk。

我尝试过的

我已经尝试了多种组合,但我想最令我惊讶的是没有的效果是:

gawk '
  { if (! ) nextfile }
  NR == 1 {[=11=] = "Filename" "StartDate" OFS [=11=]; print} 
  FNR > 1 {[=11=] =  FILENAME StartDate OFS [=11=]; print}
' OFS=',' */*.csv > ../path/file.csv

预期输出(按要求)

"START_DATE","END_DATE","UNITS","COST","COST_CURRENCY","AMOUNT
20170101","20170131","1","5.49","EUR","5.49
20170101","20170131","1","4.27","EUR","4.27
20170201","20170228","1","5.49","EUR","5.49
20170201","20170228","1","4.88","EUR","4.88
20170201","20170228","1","0.61","EUR","0.61
20170201","20170228","1","0.61","EUR","0.61
20170301","20170331","1","4.88","EUR","4.88
20170301","20170331","1","4.27","EUR","4.27"

当然,我已经尝试搜索 Google 和 SO。我看到的大多数答案都需要比我拥有的更多的 awk 知识,才能理解它们。 (我不是数据管理员,但我有数据整理任务。)

感谢您的帮助!

像下面这样的东西应该可以解决问题:

 awk -F"," 'NR==1{header=[=10=]; print [=10=]} [=10=]!=header && !=""{print [=10=]}' */*.csv > ../path/file.csv\

这里 awk 将:

  1. 用逗号分割记录-F","
  2. 如果这是 awk 遇到的第一条记录,它将变量 header 设置为该行的全部内容,然后打印 header NR==1{header=[=14=]; print [=14=]}
  3. 如果当前行的内容不是 header 并且第一个字段不为空(表示 "total" 行),则打印行 [=15=]!=header && !=""{print [=15=]}'

正如我在下面的评论中提到的,如果您的记录的第一个字段总是以 8 位日期开头,那么您可以简化(这不如上面的代码通用):

 awk -F"," 'NR == 1 ||  ~ /"[0-9]{8}"/ {print [=11=]} /*.csv > outfile.csv

本质上说如果这是第一个要处理的记录然后打印它(它是 header)或者 || 如果第一个字段是一个用双引号括起来的 8 位数字然后打印它.

另一个awk方法:-

awk -F, '
        NR == 1 {
                header = [=10=]
                print
                next
        }
        FNR > 1 &&  != "\"\""
' *.csv

这应该做...

awk 'NR==1; FNR==1{next} FNR>2{print p} {p=[=10=]}' file{1..3}

首先打印 header,跳过其他 header 行和最后一行。