gawk 中的下一个命令未产生预期结果
next command in gawk not producing expected result
我正试图跳过一堆 tab-delimited 文本文件的整个第一部分。 (我将示例数据转换为 comma-delimited。)我似乎无法弄清楚为什么这不起作用:
代码
gawk '
/[^Country Of Sale]/ {next}
/^Cloud Total/ {nextfile}
FNR > 1 {[=10=] = FILENAME OFS [=10=]; print}
' OFS='\t' /path/to/files/*.txt > path/to/new_file.txt
数据
"Start Date","End Date","UPC" "4/2/17","5/6/17","SKIP THIS LINE"
"4/2/17","5/6/17","SKIP THIS LINE" "4/2/17","5/6/17","SKIP THIS LINE"
"4/2/17","5/6/17","SKIP THIS LINE" "4/2/17","5/6/17","SKIP THIS LINE"
"Row Count","447","SKIP THIS LINE"
"Country Of Sale","Total","Total Units1","Total Units2","Total C_F","SPCU","PCUT","CPS","USPS","Total Share","EffSUBS","ActSUBS"
"AU","0","139851331","139851331","195833.36","0.001400297","1170","1.36","","1.36","91704.63","99430"
"Cloud Total","1.36" "Sales Total","243.18" "Total Amount","244.54"
预期输出
"Country Of Sale","Total","Total Units1","Total Units2","Total C_F","SPCU","PCUT","CPS","USPS","Total Share","EffSUBS","ActSUBS"
"AU","0","139851331","139851331","195833.36","0.001400297","1170","1.36","","1.36","91704.63","99430"
此外,我想将所有文件的 "Country Of Sale" 行设为 header。但是 NR & FNR 从头开始计数。鉴于 "Country Of Sale" 出现在每个文件的不同行号中,我该怎么做?
感谢您的帮助!
[...]
是一个括号表达式,其中包含字符的列表、集合或范围。它不包含字符串或字符串的否定。
[^Country Of Sale]
= [^aCFelnoOrStuy]
当你的意思可能是:
!/Country Of Sale/
这仍然不是您真正需要的。试试这个:
gawk '
BEGIN { FS=OFS="\t" }
/Country Of Sale/ { f=1 }
/Cloud Total/ { f=0; nextfile }
f { print FILENAME, [=11=] }
' RAW/iTunes/iTunesMatch/*.txt > munched/iTunesMatch_TEST.txt
看:
$ cat file
"Start Date","End Date","UPC" "4/2/17","5/6/17","SKIP THIS LINE"
"4/2/17","5/6/17","SKIP THIS LINE" "4/2/17","5/6/17","SKIP THIS LINE"
"4/2/17","5/6/17","SKIP THIS LINE" "4/2/17","5/6/17","SKIP THIS LINE"
"Row Count","447","SKIP THIS LINE"
"Country Of Sale","Total","Total Units1","Total Units2","Total C_F","SPCU","PCUT","CPS","USPS","Total Share","EffSUBS","ActSUBS"
"AU","0","139851331","139851331","195833.36","0.001400297","1170","1.36","","1.36","91704.63","99430"
"Cloud Total","1.36" "Sales Total","243.18" "Total Amount","244.54"
$ gawk '
BEGIN { FS=OFS="\t" }
/Country Of Sale/ { f=1 }
/Cloud Total/ { f=0; nextfile }
f { print FILENAME, [=12=] }
' file
file "Country Of Sale","Total","Total Units1","Total Units2","Total C_F","SPCU","PCUT","CPS","USPS","Total Share","EffSUBS","ActSUBS"
file "AU","0","139851331","139851331","195833.36","0.001400297","1170","1.36","","1.36","91704.63","99430"
如果您有多个输入文件并且只希望“销售国家/地区”行出现一次,那么一种方法是:
$ gawk '
BEGIN { FS=OFS="\t" }
/Country Of Sale/ { f=1; if (NR==FNR) print FILENAME, [=13=]; next}
/Cloud Total/ { f=0; nextfile }
f { print FILENAME, [=13=] }
' file file file
file "Country Of Sale","Total","Total Units1","Total Units2","Total C_F","SPCU","PCUT","CPS","USPS","Total Share","EffSUBS","ActSUBS"
file "AU","0","139851331","139851331","195833.36","0.001400297","1170","1.36","","1.36","91704.63","99430"
file "AU","0","139851331","139851331","195833.36","0.001400297","1170","1.36","","1.36","91704.63","99430"
file "AU","0","139851331","139851331","195833.36","0.001400297","1170","1.36","","1.36","91704.63","99430"
正如我在评论中指出的那样,/[^Country Of Sale]/
可能没有按照您认为应该的方式进行。提示:其中一个重复的空格是多余的。 (恰好空格是那个否定字符中唯一重复的字符class。)
它实际上做的是查找除 [ COSaeflnortuy]
之一以外的任何字符(方括号是元字符),如果找到则跳转到下一行。例如,如果该行包含双引号或逗号,则会跳转到下一行输入(因为方括号中既没有列出双引号也没有列出逗号)。
请注意,在您的 CSV 数据中,"Cloud Total" 不会以 C
开头;它以双引号开头。不幸的是,您搜索它的正则表达式坚持认为 C
必须是第一个字符。
我想你需要这样的东西:
gawk 'FNR==1,/Country Of Sale/ { next }
/Cloud Total/ { nextfile }
{ print }' data
它只列出了给定数据中的 AU 行(如果你在一个命令行中列出同一个文件 3 次,你会得到 3 行以 AU 开头,所以它可以跨文件工作,部分原因是范围 FNR==1,/…/
).
你应该可以从那里拿到它。如果您愿意,可以使模式更具限制性(/^"Country Of Sale",/
等)。您可以使用 { print FILENAME OFS [=20=] }
打印以文件名和输出字段分隔符(命令行中的选项卡)为前缀的行。
This, and @Ed's suggestion too, both give all of the lines of data, instead of just what's between "Country Of Sale" and "Cloud Total".
这是我得到的(在 Mac 运行 macOS Sierra 10.12.6 上,使用 home-built GNU Awk 4.1.3, API: 1.1
):
$ cat data
"Start Date","End Date","UPC" "4/2/17","5/6/17","SKIP THIS LINE"
"4/2/17","5/6/17","SKIP THIS LINE" "4/2/17","5/6/17","SKIP THIS LINE"
"4/2/17","5/6/17","SKIP THIS LINE" "4/2/17","5/6/17","SKIP THIS LINE"
"Row Count","447","SKIP THIS LINE"
"Country Of Sale","Total","Total Units1","Total Units2","Total C_F","SPCU","PCUT","CPS","USPS","Total Share","EffSUBS","ActSUBS"
"AU","0","139851331","139851331","195833.36","0.001400297","1170","1.36","","1.36","91704.63","99430"
"Cloud Total","1.36" "Sales Total","243.18" "Total Amount","244.54"
$ gawk 'FNR==1,/Country Of Sale/{next} /Cloud Total/ {nextfile} { print }' data data data
"AU","0","139851331","139851331","195833.36","0.001400297","1170","1.36","","1.36","91704.63","99430"
"AU","0","139851331","139851331","195833.36","0.001400297","1170","1.36","","1.36","91704.63","99430"
"AU","0","139851331","139851331","195833.36","0.001400297","1170","1.36","","1.36","91704.63","99430"
$
考虑到我将文件交给它处理 3 次,这正是我所期望的,而且似乎也是您想要的。
如果您想要在输出中使用 "Country Of Sale" 标题行,可以很容易地添加它:
gawk 'FNR==1,/Country Of Sale/ { if ([=12=] ~ /Country Of Sale/) print; next }
/Cloud Total/ { nextfile }
{ print }' data
如果您只想要 header 一次,即使它出现在许多文件中,那么:
gawk 'FNR==1,/Country Of Sale/ { if ([=13=] ~ /Country Of Sale/ && hdr_count++ == 0) print; next }
/Cloud Total/ { nextfile }
{ print }' data
感谢@EdMorton @@JonathanLeffler 为我提供了必要的线索。最终起作用的是使用 /^Country Of Sale/{next}
& /^Cloud Total/ {nextfile}
。接下来,我会去弄清楚 *为什么* 这行得通!
我正试图跳过一堆 tab-delimited 文本文件的整个第一部分。 (我将示例数据转换为 comma-delimited。)我似乎无法弄清楚为什么这不起作用:
代码
gawk '
/[^Country Of Sale]/ {next}
/^Cloud Total/ {nextfile}
FNR > 1 {[=10=] = FILENAME OFS [=10=]; print}
' OFS='\t' /path/to/files/*.txt > path/to/new_file.txt
数据
"Start Date","End Date","UPC" "4/2/17","5/6/17","SKIP THIS LINE"
"4/2/17","5/6/17","SKIP THIS LINE" "4/2/17","5/6/17","SKIP THIS LINE"
"4/2/17","5/6/17","SKIP THIS LINE" "4/2/17","5/6/17","SKIP THIS LINE"
"Row Count","447","SKIP THIS LINE"
"Country Of Sale","Total","Total Units1","Total Units2","Total C_F","SPCU","PCUT","CPS","USPS","Total Share","EffSUBS","ActSUBS"
"AU","0","139851331","139851331","195833.36","0.001400297","1170","1.36","","1.36","91704.63","99430"
"Cloud Total","1.36" "Sales Total","243.18" "Total Amount","244.54"
预期输出
"Country Of Sale","Total","Total Units1","Total Units2","Total C_F","SPCU","PCUT","CPS","USPS","Total Share","EffSUBS","ActSUBS"
"AU","0","139851331","139851331","195833.36","0.001400297","1170","1.36","","1.36","91704.63","99430"
此外,我想将所有文件的 "Country Of Sale" 行设为 header。但是 NR & FNR 从头开始计数。鉴于 "Country Of Sale" 出现在每个文件的不同行号中,我该怎么做?
感谢您的帮助!
[...]
是一个括号表达式,其中包含字符的列表、集合或范围。它不包含字符串或字符串的否定。
[^Country Of Sale]
= [^aCFelnoOrStuy]
当你的意思可能是:
!/Country Of Sale/
这仍然不是您真正需要的。试试这个:
gawk '
BEGIN { FS=OFS="\t" }
/Country Of Sale/ { f=1 }
/Cloud Total/ { f=0; nextfile }
f { print FILENAME, [=11=] }
' RAW/iTunes/iTunesMatch/*.txt > munched/iTunesMatch_TEST.txt
看:
$ cat file
"Start Date","End Date","UPC" "4/2/17","5/6/17","SKIP THIS LINE"
"4/2/17","5/6/17","SKIP THIS LINE" "4/2/17","5/6/17","SKIP THIS LINE"
"4/2/17","5/6/17","SKIP THIS LINE" "4/2/17","5/6/17","SKIP THIS LINE"
"Row Count","447","SKIP THIS LINE"
"Country Of Sale","Total","Total Units1","Total Units2","Total C_F","SPCU","PCUT","CPS","USPS","Total Share","EffSUBS","ActSUBS"
"AU","0","139851331","139851331","195833.36","0.001400297","1170","1.36","","1.36","91704.63","99430"
"Cloud Total","1.36" "Sales Total","243.18" "Total Amount","244.54"
$ gawk '
BEGIN { FS=OFS="\t" }
/Country Of Sale/ { f=1 }
/Cloud Total/ { f=0; nextfile }
f { print FILENAME, [=12=] }
' file
file "Country Of Sale","Total","Total Units1","Total Units2","Total C_F","SPCU","PCUT","CPS","USPS","Total Share","EffSUBS","ActSUBS"
file "AU","0","139851331","139851331","195833.36","0.001400297","1170","1.36","","1.36","91704.63","99430"
如果您有多个输入文件并且只希望“销售国家/地区”行出现一次,那么一种方法是:
$ gawk '
BEGIN { FS=OFS="\t" }
/Country Of Sale/ { f=1; if (NR==FNR) print FILENAME, [=13=]; next}
/Cloud Total/ { f=0; nextfile }
f { print FILENAME, [=13=] }
' file file file
file "Country Of Sale","Total","Total Units1","Total Units2","Total C_F","SPCU","PCUT","CPS","USPS","Total Share","EffSUBS","ActSUBS"
file "AU","0","139851331","139851331","195833.36","0.001400297","1170","1.36","","1.36","91704.63","99430"
file "AU","0","139851331","139851331","195833.36","0.001400297","1170","1.36","","1.36","91704.63","99430"
file "AU","0","139851331","139851331","195833.36","0.001400297","1170","1.36","","1.36","91704.63","99430"
正如我在评论中指出的那样,/[^Country Of Sale]/
可能没有按照您认为应该的方式进行。提示:其中一个重复的空格是多余的。 (恰好空格是那个否定字符中唯一重复的字符class。)
它实际上做的是查找除 [ COSaeflnortuy]
之一以外的任何字符(方括号是元字符),如果找到则跳转到下一行。例如,如果该行包含双引号或逗号,则会跳转到下一行输入(因为方括号中既没有列出双引号也没有列出逗号)。
请注意,在您的 CSV 数据中,"Cloud Total" 不会以 C
开头;它以双引号开头。不幸的是,您搜索它的正则表达式坚持认为 C
必须是第一个字符。
我想你需要这样的东西:
gawk 'FNR==1,/Country Of Sale/ { next }
/Cloud Total/ { nextfile }
{ print }' data
它只列出了给定数据中的 AU 行(如果你在一个命令行中列出同一个文件 3 次,你会得到 3 行以 AU 开头,所以它可以跨文件工作,部分原因是范围 FNR==1,/…/
).
你应该可以从那里拿到它。如果您愿意,可以使模式更具限制性(/^"Country Of Sale",/
等)。您可以使用 { print FILENAME OFS [=20=] }
打印以文件名和输出字段分隔符(命令行中的选项卡)为前缀的行。
This, and @Ed's suggestion too, both give all of the lines of data, instead of just what's between "Country Of Sale" and "Cloud Total".
这是我得到的(在 Mac 运行 macOS Sierra 10.12.6 上,使用 home-built GNU Awk 4.1.3, API: 1.1
):
$ cat data
"Start Date","End Date","UPC" "4/2/17","5/6/17","SKIP THIS LINE"
"4/2/17","5/6/17","SKIP THIS LINE" "4/2/17","5/6/17","SKIP THIS LINE"
"4/2/17","5/6/17","SKIP THIS LINE" "4/2/17","5/6/17","SKIP THIS LINE"
"Row Count","447","SKIP THIS LINE"
"Country Of Sale","Total","Total Units1","Total Units2","Total C_F","SPCU","PCUT","CPS","USPS","Total Share","EffSUBS","ActSUBS"
"AU","0","139851331","139851331","195833.36","0.001400297","1170","1.36","","1.36","91704.63","99430"
"Cloud Total","1.36" "Sales Total","243.18" "Total Amount","244.54"
$ gawk 'FNR==1,/Country Of Sale/{next} /Cloud Total/ {nextfile} { print }' data data data
"AU","0","139851331","139851331","195833.36","0.001400297","1170","1.36","","1.36","91704.63","99430"
"AU","0","139851331","139851331","195833.36","0.001400297","1170","1.36","","1.36","91704.63","99430"
"AU","0","139851331","139851331","195833.36","0.001400297","1170","1.36","","1.36","91704.63","99430"
$
考虑到我将文件交给它处理 3 次,这正是我所期望的,而且似乎也是您想要的。
如果您想要在输出中使用 "Country Of Sale" 标题行,可以很容易地添加它:
gawk 'FNR==1,/Country Of Sale/ { if ([=12=] ~ /Country Of Sale/) print; next }
/Cloud Total/ { nextfile }
{ print }' data
如果您只想要 header 一次,即使它出现在许多文件中,那么:
gawk 'FNR==1,/Country Of Sale/ { if ([=13=] ~ /Country Of Sale/ && hdr_count++ == 0) print; next }
/Cloud Total/ { nextfile }
{ print }' data
感谢@EdMorton @@JonathanLeffler 为我提供了必要的线索。最终起作用的是使用 /^Country Of Sale/{next}
& /^Cloud Total/ {nextfile}
。接下来,我会去弄清楚 *为什么* 这行得通!