清理 CSV 格式 phone 数字的不同格式

Sanitize the different formats of CSV-formatted phone numbers

假设有如下几种不同类型的CSV格式电话号码:

这是第一个 CSV 文件,行如下:

"Name","Address","FullPhone"
"Mike Wise","101 Abc Drive","4061234567" // Need to separate area code from the rest

这是另一个 CSV 文件,其中包含以下几行:

"Name","Address","Areacode","Phone"
"Mike Wise","101 Abc Drive","406","123-4567" // Need to remove the dash in the seven-digit phone number

是否有某种 sed 单行使其成为以下通用格式?

"Name","Address","NPA","TELNO"
"Mike Wise","101 Abc Drive","406","1234567"

我更喜欢 sed 一行,但如果它必须不止一行,那就这样吧。此外,不需要 sed。只是觉得 sed 可以更简单,虽然我还没有想出 sed 解决方案。

$ cat tst.awk
BEGIN { FS=OFS="\",\"" }
{
    if (NR==1) {
         = "NPA"
         = "TELNO\""
    }
    else {
        gsub(/-/,"",$NF)
        if (NF==3) {
            sub(/.{3}/,"&"OFS,$NF)
        }
    }
    print
}

$ cat file1
"Name","Address","FullPhone"
"Mike Wise","101 Abc Drive","4061234567"

$ awk -f tst.awk file1
"Name","Address","NPA","TELNO"
"Mike Wise","101 Abc Drive","406","1234567"

$ cat file2            
"Name","Address","Areacode","Phone"
"Mike Wise","101 Abc Drive","406","123-4567"

$ awk -f tst.awk file2
"Name","Address","NPA","TELNO"
"Mike Wise","101 Abc Drive","406","1234567"

还有一些您没有要求但可能会发生的特定输入,如果确实发生,无论如何都会被正确处理:

$ cat file3
"Name","Address","FullPhone"
"Mike Wise","101 Abc Drive","406-1234-567"

$ awk -f tst.awk file3
"Name","Address","NPA","TELNO"
"Mike Wise","101 Abc Drive","406","1234567"

如果您需要从输入 phone 数字中删除空格,而不仅仅是 -s,那么只需将 gsub(/-/,"",$NF) 更改为 gsub(/[-[:space:]]/,"",$NF)gsub(/[^0-9]/,"",$NF) 或类似的。

sed '1 c\
"Name","Address","Areacode","Phone"
     s/"\([0-9]\{3\}\)\([0-9]\{7\}\)"[[:space:]]*$/"",""/
     s/-\([0-9]\{1,6\}\)"[[:space:]]*$/"/
     ' YourFile

将适用于您的 csv 文件格式(在@EdMorton 的评论后也适用于 header)

  • 1c \: 把第一行改成下面一行(强行用这行代替原来的header)
  • first s/// 将更改任何行,其中 ta 尾随 3 位数字,后跟 7 位数字(因此 1 包中有 10 位数字),用双引号包围,2 字段值分别为 3 和 7 位数字,使用组功能小号///
  • second s/// 将更改尾随 -,后跟 1 到 6 位数字和双引号,但不使用 - 使用组功能(参考 ).

第一个 s/// 不会占用第二个样本的行(没有模式对应),第二个不会占用第一个样本的行(同样的原因)并且也不会占用第一个 s // 更改的行/(还是一样的道理) 第二行