根据 Linux 中的特定列值修改多列值

Modify multiple columns value based on specific column values in Linux

我有一个包含以下数据的文件

"col1","col2","col3","col4","col5","col6"
"CACR","0","SO2","50","6","2.0"
"FF","15","CO2","20","4","3"
"CACR","25","NOx","30","10",        
"CACR","50","CO","40","5","0"

我想找到包含 CACR 的每一行,然后用 Linux 终端。所以,我的输出如下所示:

"col1","col2","col3","col4","col5","col6"
"CACR","0","SO2","25","3","1"
"CACR","25","NOX","30","10",           
"CACR","50","CO","40","5","0"

我正在尝试使用 grep 和 awk

grep  CACR  file.csv | awk -F "," ' != 0;  == "" { = /; = /;  = /;  = /}1' 

但是无法得到任何想要的输出。

comment, the primary problem is that the double quotes around the fields mean that when a field is interpreted as a number (e.g. with a division), the value is zero. I think you need to write Awk functions to remove and reinstate the double quotes. With those in place, it's mostly a SMOP 中所述 — 简单的编程问题。

这是我的版本。它可以写得更简洁(更少的换行符,更少的 spaces),但我更喜欢清晰而不是简洁。

script.awk

function strip_quotes(s)
{
    gsub(/"/, "", s)
    return s
}
function add_quotes(s)
{
    return sprintf("\"%s\"", s)
}
BEGIN        { FS = "," }
NR == 1      { print; next }
[=10=] !~ /CACR/ { next }
 == "" ||  == "\"0\"" { print; next }
        {
            div = strip_quotes()
            printf("%s,%s,%s,%s,%s,%s\n",
                   ,
                   add_quotes(strip_quotes() / div),
                   ,
                   add_quotes(strip_quotes() / div),
                   add_quotes(strip_quotes() / div),
                   add_quotes(strip_quotes() / div))
        }

data

"col1","col2","col3","col4","col5","col6"
"CACR","0","SO2","50","6","2.0"
"FF","15","CO2","20","4","3"
"CACR","25","NOx","30","10",
"CACR","50","CO","40","5","0"

输出

$ awk -f script.awk data
"col1","col2","col3","col4","col5","col6"
"CACR","0","SO2","25","3","1"
"CACR","25","NOx","30","10",
"CACR","50","CO","40","5","0"
$

变体script3.awk

此代码也将输出字段分隔符 OFS 设置为逗号,并重置 </code>、<code></code> 和 <code> 的值在使用 print 打印修改后的 [=25=].

之前
function strip_quotes(s)
{
    gsub(/"/, "", s)
    return s
}
function add_quotes(s)
{
    return sprintf("\"%s\"", s)
}
BEGIN        { FS = ","; OFS = "," }
NR == 1      { print; next }
[=13=] !~ /CACR/ { next }
 == "" ||  == "\"0\"" { print; next }
        {
            div = strip_quotes()
             = add_quotes(strip_quotes() / div)
             = add_quotes(strip_quotes() / div)
             = add_quotes(strip_quotes() / div)
             = add_quotes(strip_quotes() / div)
            print
        }

数据验证

两个版本的脚本都可以更严格,验证是否有 5 或 6 列(拒绝包含更多列或更少列的行或抱怨它们)。标题的检查可以坚持 6 列。检查 div 是一个 non-zero 数字可能是明智的。检查 </code>、<code></code> 和 <code> 中的每一个都是一个数字可能是明智的。样本数据中的除数(第 6 列)很方便;如果数字不是那么简单,例如 7,您可能需要做一些工作,结果可能有很多小数位。您需要决定应该如何格式化这些数字(默认设置可能是正确的,也可能不是)。还可能值得检查每个字段中的数据是否与正则表达式 /^"[^"]*"$/ 匹配(因此每个值都用双引号引起来)。

尾随白色space

规则 == "" || == "\"0\"" { print; next } 不能很好地处理尾随白色 space。可以修改为:

 ~ /^[[:space:]]*$/ ||  == "\"0\"" { print; next }

识别尾随白色 space 并将其视为零。添加:

是可能的,而且可能是明智的
if (div == 0) { print; next }

赋值给div后。如果找到的值为零,则存在问题。也可以抱怨——产生错误消息诊断 'malformed data'.

有多少验证和错误预防是值得的取决于您的输入数据有多不守规矩。如果你正在处理 human-generated 数据,你必须处理人类改变规则和向程序提供不稳定或错误数据的倾向,并且你可能需要处理(诊断)意外输入。如果您正在处理 machine-generated 数据,它通常更统一,并且您可以减少验证工作。

大多数依赖于正则表达式的解决方案必须在足够好地工作和打破不稳定的输入之间取得平衡。输入越不稳定,设计 bomb-proof (fool-proof) 正则表达式就越难。俗话说,“如果你做了什么idiot-proof,就会有人做出更好的白痴”。