清理 CSV 格式 phone 数字的不同格式
Sanitize the different formats of CSV-formatted phone numbers
假设有如下几种不同类型的CSV格式电话号码:
这是第一个 CSV 文件,行如下:
"Name","Address","FullPhone"
"Mike Wise","101 Abc Drive","4061234567" // Need to separate area code from the rest
这是另一个 CSV 文件,其中包含以下几行:
"Name","Address","Areacode","Phone"
"Mike Wise","101 Abc Drive","406","123-4567" // Need to remove the dash in the seven-digit phone number
是否有某种 sed 单行使其成为以下通用格式?
"Name","Address","NPA","TELNO"
"Mike Wise","101 Abc Drive","406","1234567"
我更喜欢 sed 一行,但如果它必须不止一行,那就这样吧。此外,不需要 sed。只是觉得 sed 可以更简单,虽然我还没有想出 sed 解决方案。
$ cat tst.awk
BEGIN { FS=OFS="\",\"" }
{
if (NR==1) {
= "NPA"
= "TELNO\""
}
else {
gsub(/-/,"",$NF)
if (NF==3) {
sub(/.{3}/,"&"OFS,$NF)
}
}
print
}
$ cat file1
"Name","Address","FullPhone"
"Mike Wise","101 Abc Drive","4061234567"
$ awk -f tst.awk file1
"Name","Address","NPA","TELNO"
"Mike Wise","101 Abc Drive","406","1234567"
$ cat file2
"Name","Address","Areacode","Phone"
"Mike Wise","101 Abc Drive","406","123-4567"
$ awk -f tst.awk file2
"Name","Address","NPA","TELNO"
"Mike Wise","101 Abc Drive","406","1234567"
还有一些您没有要求但可能会发生的特定输入,如果确实发生,无论如何都会被正确处理:
$ cat file3
"Name","Address","FullPhone"
"Mike Wise","101 Abc Drive","406-1234-567"
$ awk -f tst.awk file3
"Name","Address","NPA","TELNO"
"Mike Wise","101 Abc Drive","406","1234567"
如果您需要从输入 phone 数字中删除空格,而不仅仅是 -
s,那么只需将 gsub(/-/,"",$NF)
更改为 gsub(/[-[:space:]]/,"",$NF)
或 gsub(/[^0-9]/,"",$NF)
或类似的。
sed '1 c\
"Name","Address","Areacode","Phone"
s/"\([0-9]\{3\}\)\([0-9]\{7\}\)"[[:space:]]*$/"",""/
s/-\([0-9]\{1,6\}\)"[[:space:]]*$/"/
' YourFile
将适用于您的 csv 文件格式(在@EdMorton 的评论后也适用于 header)
1c \
: 把第一行改成下面一行(强行用这行代替原来的header)
- first s/// 将更改任何行,其中 ta 尾随 3 位数字,后跟 7 位数字(因此 1 包中有 10 位数字),用双引号包围,2 字段值分别为 3 和 7 位数字,使用组功能小号///
- second s/// 将更改尾随
-
,后跟 1 到 6 位数字和双引号,但不使用 -
使用组功能(参考
).
第一个 s/// 不会占用第二个样本的行(没有模式对应),第二个不会占用第一个样本的行(同样的原因)并且也不会占用第一个 s // 更改的行/(还是一样的道理)
第二行
假设有如下几种不同类型的CSV格式电话号码:
这是第一个 CSV 文件,行如下:
"Name","Address","FullPhone"
"Mike Wise","101 Abc Drive","4061234567" // Need to separate area code from the rest
这是另一个 CSV 文件,其中包含以下几行:
"Name","Address","Areacode","Phone"
"Mike Wise","101 Abc Drive","406","123-4567" // Need to remove the dash in the seven-digit phone number
是否有某种 sed 单行使其成为以下通用格式?
"Name","Address","NPA","TELNO"
"Mike Wise","101 Abc Drive","406","1234567"
我更喜欢 sed 一行,但如果它必须不止一行,那就这样吧。此外,不需要 sed。只是觉得 sed 可以更简单,虽然我还没有想出 sed 解决方案。
$ cat tst.awk
BEGIN { FS=OFS="\",\"" }
{
if (NR==1) {
= "NPA"
= "TELNO\""
}
else {
gsub(/-/,"",$NF)
if (NF==3) {
sub(/.{3}/,"&"OFS,$NF)
}
}
print
}
$ cat file1
"Name","Address","FullPhone"
"Mike Wise","101 Abc Drive","4061234567"
$ awk -f tst.awk file1
"Name","Address","NPA","TELNO"
"Mike Wise","101 Abc Drive","406","1234567"
$ cat file2
"Name","Address","Areacode","Phone"
"Mike Wise","101 Abc Drive","406","123-4567"
$ awk -f tst.awk file2
"Name","Address","NPA","TELNO"
"Mike Wise","101 Abc Drive","406","1234567"
还有一些您没有要求但可能会发生的特定输入,如果确实发生,无论如何都会被正确处理:
$ cat file3
"Name","Address","FullPhone"
"Mike Wise","101 Abc Drive","406-1234-567"
$ awk -f tst.awk file3
"Name","Address","NPA","TELNO"
"Mike Wise","101 Abc Drive","406","1234567"
如果您需要从输入 phone 数字中删除空格,而不仅仅是 -
s,那么只需将 gsub(/-/,"",$NF)
更改为 gsub(/[-[:space:]]/,"",$NF)
或 gsub(/[^0-9]/,"",$NF)
或类似的。
sed '1 c\
"Name","Address","Areacode","Phone"
s/"\([0-9]\{3\}\)\([0-9]\{7\}\)"[[:space:]]*$/"",""/
s/-\([0-9]\{1,6\}\)"[[:space:]]*$/"/
' YourFile
将适用于您的 csv 文件格式(在@EdMorton 的评论后也适用于 header)
1c \
: 把第一行改成下面一行(强行用这行代替原来的header)- first s/// 将更改任何行,其中 ta 尾随 3 位数字,后跟 7 位数字(因此 1 包中有 10 位数字),用双引号包围,2 字段值分别为 3 和 7 位数字,使用组功能小号///
- second s/// 将更改尾随
-
,后跟 1 到 6 位数字和双引号,但不使用-
使用组功能(参考).
第一个 s/// 不会占用第二个样本的行(没有模式对应),第二个不会占用第一个样本的行(同样的原因)并且也不会占用第一个 s // 更改的行/(还是一样的道理) 第二行