(g)awk 下一个文件在部分空行
(g)awk next file on partially blank line
问题
我只需要合并一大堆文件并从第一个文件中删除 header(第 1 行)。
数据
以下是其中三个文件的最后三行(第 1 行:header):
"START_DATE","END_DATE","UNITS","COST","COST_CURRENCY","AMOUNT"
"20170101","20170131","1","5.49","EUR","5.49"
"20170101","20170131","1","4.27","EUR","4.27"
"","","","","9.76",""
"START_DATE","END_DATE","UNITS","COST","COST_CURRENCY","AMOUNT"
"20170201","20170228","1","5.49","EUR","5.49"
"20170201","20170228","1","4.88","EUR","4.88"
"20170201","20170228","1","0.61","EUR","0.61"
"20170201","20170228","1","0.61","EUR","0.61"
"","","","","11.59",""
START_DATE","END_DATE","UNITS","COST","COST_CURRENCY","AMOUNT"
"20170301","20170331","1","4.88","EUR","4.88"
"20170301","20170331","1","4.27","EUR","4.27"
"","","","","9.15",""
问题(续)
如您所见,最后一行在第 5 列中有一个数字(它是列总数)。当然,我不想要最后一行。但它(显然)在每个文件的不同行号上。
(G)awk 显然是解决方案,但我不知道 (g)awk。
我尝试过的
我已经尝试了多种组合,但我想最令我惊讶的是没有的效果是:
gawk '
{ if (! ) nextfile }
NR == 1 {[=11=] = "Filename" "StartDate" OFS [=11=]; print}
FNR > 1 {[=11=] = FILENAME StartDate OFS [=11=]; print}
' OFS=',' */*.csv > ../path/file.csv
预期输出(按要求)
"START_DATE","END_DATE","UNITS","COST","COST_CURRENCY","AMOUNT
20170101","20170131","1","5.49","EUR","5.49
20170101","20170131","1","4.27","EUR","4.27
20170201","20170228","1","5.49","EUR","5.49
20170201","20170228","1","4.88","EUR","4.88
20170201","20170228","1","0.61","EUR","0.61
20170201","20170228","1","0.61","EUR","0.61
20170301","20170331","1","4.88","EUR","4.88
20170301","20170331","1","4.27","EUR","4.27"
当然,我已经尝试搜索 Google 和 SO。我看到的大多数答案都需要比我拥有的更多的 awk 知识,才能理解它们。 (我不是数据管理员,但我有数据整理任务。)
感谢您的帮助!
像下面这样的东西应该可以解决问题:
awk -F"," 'NR==1{header=[=10=]; print [=10=]} [=10=]!=header && !=""{print [=10=]}' */*.csv > ../path/file.csv\
这里 awk 将:
- 用逗号分割记录
-F","
- 如果这是 awk 遇到的第一条记录,它将变量
header
设置为该行的全部内容,然后打印 header NR==1{header=[=14=]; print [=14=]}
- 如果当前行的内容不是 header 并且第一个字段不为空(表示 "total" 行),则打印行
[=15=]!=header && !=""{print [=15=]}'
正如我在下面的评论中提到的,如果您的记录的第一个字段总是以 8 位日期开头,那么您可以简化(这不如上面的代码通用):
awk -F"," 'NR == 1 || ~ /"[0-9]{8}"/ {print [=11=]} /*.csv > outfile.csv
本质上说如果这是第一个要处理的记录然后打印它(它是 header)或者 ||
如果第一个字段是一个用双引号括起来的 8 位数字然后打印它.
另一个awk方法:-
awk -F, '
NR == 1 {
header = [=10=]
print
next
}
FNR > 1 && != "\"\""
' *.csv
这应该做...
awk 'NR==1; FNR==1{next} FNR>2{print p} {p=[=10=]}' file{1..3}
首先打印 header,跳过其他 header 行和最后一行。
问题
我只需要合并一大堆文件并从第一个文件中删除 header(第 1 行)。
数据
以下是其中三个文件的最后三行(第 1 行:header):
"START_DATE","END_DATE","UNITS","COST","COST_CURRENCY","AMOUNT"
"20170101","20170131","1","5.49","EUR","5.49"
"20170101","20170131","1","4.27","EUR","4.27"
"","","","","9.76",""
"START_DATE","END_DATE","UNITS","COST","COST_CURRENCY","AMOUNT"
"20170201","20170228","1","5.49","EUR","5.49"
"20170201","20170228","1","4.88","EUR","4.88"
"20170201","20170228","1","0.61","EUR","0.61"
"20170201","20170228","1","0.61","EUR","0.61"
"","","","","11.59",""
START_DATE","END_DATE","UNITS","COST","COST_CURRENCY","AMOUNT"
"20170301","20170331","1","4.88","EUR","4.88"
"20170301","20170331","1","4.27","EUR","4.27"
"","","","","9.15",""
问题(续)
如您所见,最后一行在第 5 列中有一个数字(它是列总数)。当然,我不想要最后一行。但它(显然)在每个文件的不同行号上。
(G)awk 显然是解决方案,但我不知道 (g)awk。
我尝试过的
我已经尝试了多种组合,但我想最令我惊讶的是没有的效果是:
gawk '
{ if (! ) nextfile }
NR == 1 {[=11=] = "Filename" "StartDate" OFS [=11=]; print}
FNR > 1 {[=11=] = FILENAME StartDate OFS [=11=]; print}
' OFS=',' */*.csv > ../path/file.csv
预期输出(按要求)
"START_DATE","END_DATE","UNITS","COST","COST_CURRENCY","AMOUNT
20170101","20170131","1","5.49","EUR","5.49
20170101","20170131","1","4.27","EUR","4.27
20170201","20170228","1","5.49","EUR","5.49
20170201","20170228","1","4.88","EUR","4.88
20170201","20170228","1","0.61","EUR","0.61
20170201","20170228","1","0.61","EUR","0.61
20170301","20170331","1","4.88","EUR","4.88
20170301","20170331","1","4.27","EUR","4.27"
当然,我已经尝试搜索 Google 和 SO。我看到的大多数答案都需要比我拥有的更多的 awk 知识,才能理解它们。 (我不是数据管理员,但我有数据整理任务。)
感谢您的帮助!
像下面这样的东西应该可以解决问题:
awk -F"," 'NR==1{header=[=10=]; print [=10=]} [=10=]!=header && !=""{print [=10=]}' */*.csv > ../path/file.csv\
这里 awk 将:
- 用逗号分割记录
-F","
- 如果这是 awk 遇到的第一条记录,它将变量
header
设置为该行的全部内容,然后打印 headerNR==1{header=[=14=]; print [=14=]}
- 如果当前行的内容不是 header 并且第一个字段不为空(表示 "total" 行),则打印行
[=15=]!=header && !=""{print [=15=]}'
正如我在下面的评论中提到的,如果您的记录的第一个字段总是以 8 位日期开头,那么您可以简化(这不如上面的代码通用):
awk -F"," 'NR == 1 || ~ /"[0-9]{8}"/ {print [=11=]} /*.csv > outfile.csv
本质上说如果这是第一个要处理的记录然后打印它(它是 header)或者 ||
如果第一个字段是一个用双引号括起来的 8 位数字然后打印它.
另一个awk方法:-
awk -F, '
NR == 1 {
header = [=10=]
print
next
}
FNR > 1 && != "\"\""
' *.csv
这应该做...
awk 'NR==1; FNR==1{next} FNR>2{print p} {p=[=10=]}' file{1..3}
首先打印 header,跳过其他 header 行和最后一行。