我在 Unix 中有一个制表符分隔的文件,它有数据问题
I have a Tab separated file in Unix which has data issue
我必须确保每行有 4 列,但输入数据相当混乱:
- 第一行是header。
- 第二行有效,因为它有 4 列。
- 第三个也有效(描述字段为空也可以)
ID 字段和 "god bless me" 最后一列 PNumber 不是空字段。
正如你所见,第 4 行因为 "Description column" 中的换行符而变得混乱,它跨越了多行。
ID Name Description Phnumber
1051 John 5674 I am doing good, is this task we need to fix 908342
1065 Rohit 9876246
10402 rob I am
doing good,
is this task we need to fix 908341
105552 "Julin rob hain" i know what to do just let me do it
"
"
"
"
"
"
908452 1051 Dave I am doing reporting this week 88889999
也许截图会更容易看出问题
每一行都以一个数字开始,以一个数字结束。每行应有 4 列。
期望的输出
ID Name Description Phnumber
1051 John 5674 I am doing good, is this task we need to fix 908342
1065 Rohit 9876246
10402 rob I am doing good, 563 is this task we need to fix 908341
105552 "Julin rob hain" i know what to do just let me do it 908452
1051 Dave I am doing reporting this week 88889999
数据为样本数据实际文件有12列。是的,列之间可以有数字,少数是日期字段(如 2017-03-02)
awk
救援!
假设除第一个和最后一个字段外,所有数字字段都没有出现
awk 'NR==1;
NR>1 {for(i=1;i<=NF;i++)
{if($i~/[0-9]+/) s=!s; printf "%s", $i (s?OFS:RS)}}' file
ID Name Description Phnumber
1051 John I am doing good, is this task we need to fix 908342
10423 rob I am doing good, is this task we need to fix 908341
1052 Julin rob hain i know what to do just let me do it " " " " " " 908452
1051 Dave I am doing reporting this week 88889999
也许将 OFS
设置为 \t
以获得更多结构
这成功了
猫file_name | perl -0pe 's/\n(?!([0-9]{6}|$)\t)//g' | perl -0pe 's/\r(?!([0-9]{6}|$)\t)//g' | sed '/^$/d'
我必须确保每行有 4 列,但输入数据相当混乱:
- 第一行是header。
- 第二行有效,因为它有 4 列。
- 第三个也有效(描述字段为空也可以)
ID 字段和 "god bless me" 最后一列 PNumber 不是空字段。
正如你所见,第 4 行因为 "Description column" 中的换行符而变得混乱,它跨越了多行。
ID Name Description Phnumber
1051 John 5674 I am doing good, is this task we need to fix 908342
1065 Rohit 9876246
10402 rob I am
doing good,
is this task we need to fix 908341
105552 "Julin rob hain" i know what to do just let me do it
"
"
"
"
"
"
908452 1051 Dave I am doing reporting this week 88889999
也许截图会更容易看出问题
每一行都以一个数字开始,以一个数字结束。每行应有 4 列。
期望的输出
ID Name Description Phnumber
1051 John 5674 I am doing good, is this task we need to fix 908342
1065 Rohit 9876246
10402 rob I am doing good, 563 is this task we need to fix 908341
105552 "Julin rob hain" i know what to do just let me do it 908452
1051 Dave I am doing reporting this week 88889999
数据为样本数据实际文件有12列。是的,列之间可以有数字,少数是日期字段(如 2017-03-02)
awk
救援!
假设除第一个和最后一个字段外,所有数字字段都没有出现
awk 'NR==1;
NR>1 {for(i=1;i<=NF;i++)
{if($i~/[0-9]+/) s=!s; printf "%s", $i (s?OFS:RS)}}' file
ID Name Description Phnumber
1051 John I am doing good, is this task we need to fix 908342
10423 rob I am doing good, is this task we need to fix 908341
1052 Julin rob hain i know what to do just let me do it " " " " " " 908452
1051 Dave I am doing reporting this week 88889999
也许将 OFS
设置为 \t
以获得更多结构
这成功了
猫file_name | perl -0pe 's/\n(?!([0-9]{6}|$)\t)//g' | perl -0pe 's/\r(?!([0-9]{6}|$)\t)//g' | sed '/^$/d'