将具有不同分隔符的文件中的几个 space-delimited 字段提取到 Bash 中的另一个文件中

Question

我有一个来自某些 third-party Windows 软件的 Unicode/UTF-8 文本文件，其中包含大约十列数据。

header 行是 tab-delimited。但是，剩余的行是 space-delimited（不是 tab-delimited！）（如在 Notepad++ 或 TextWrangler 中打开文件时所见）。

这是文件的前四行（作为示例）： x y z(ns) z(cm) z-abs(cm) 经度- E 纬度- N type_of_object 说明 728243.03 5993753.83 0 0 0 143.537779835969 -36.1741232463362 linestart DRIVEWAYGRAVEL 728242.07 5993756.02 0 0 0 143.537768534943 -36.1741037476109 线 DRIVEWAYGRAVEL 728242.26 5993756.11 0 0 0 143.537770619485 -36.1741028922293 linestart DRIVEWAYGRAVEL

x       y   z(ns)       z(cm)   z-abs(cm)   longitude-  E   latitude-   N   type_of_object  description
 728243.03     5993753.83    0             0             0             143.537779835969           -36.1741232463362           linestart     DRIVEWAYGRAVEL
 728242.07     5993756.02    0             0             0             143.537768534943           -36.1741037476109           line          DRIVEWAYGRAVEL
 728242.26     5993756.11    0             0             0             143.537770619485           -36.1741028922293           linestart     DRIVEWAYGRAVEL

(n.b.除header行外每行开头的space)

我正在尝试编写 Bash 脚本来重新格式化数据以导入到不同的 Windows 程序中。

（我知道我可以在 Windows 命令行上执行此操作，但我没有这方面的经验，所以我更愿意将文件复制到我的 Debian 机器上并在 Bash.这意味着输入文件和输出文件需要兼容Windows，但脚本本身显然是运行 in Linux。）

我需要做以下事情：

提取前两列（x 和 y 坐标），但仅适用于 second-last 列中包含 "rectangle" 的行，使用逗号分隔符。
在每行末尾添加 1 或 0。第一行应该有一个 1，第 2-4 行应该有一个 0，第 5 行应该有一个 1，第 6-8 行应该有一个 0，依此类推。也就是说，每四行（从第一行开始）应该有一个 1，每隔一行应该有一个 0。

所以输出文件应该是这样的：

728257.89,5993759.24,1
728254.83,5993758.54,0
728251.82,5993762.4,0
728242.45,5993765.07,0

我试过了the answer to this question。例如

awk '
NR==1{
    for(i=1;i<=NF;i++)
        if($i!="z(ns)")
            cols[i]
}
{
    for(i=1;i<=NF;i++)
        if(i in cols)
            printf "%s ",$i
    printf "\n"
}' input.file > output.file

...删除第三列（然后对此进行变体以删除其他不需要的列）。但是，我只剩下一个空的输出文件。

我也试过用 grep 和 awk 一起破解一个解决方案：

touch output.txt
count=0
IFS=$'\n'
set -f #disable globbing
for i in $( grep "rectangle" $inputFile )
do
    Xcoord=$(awk 'BEGIN { FS=" " } { print  }' $i )
    printf "$Xcoord" >> output.txt
    echo ","
    Ycoord=$(awk 'BEGIN { FS=" " } { print  }' $i )
    printf "$Ycoord" >> output.txt
    printf ","
    count=$((count+1))
    if [[ count = "1" ]]
    then
        printf "$count\n" >> output.txt
    else
        printf "0\n" >> output.txt
    fi
done
set +f #re-enable globbing for future use of the terminal.

...这背后的想法是： -对于 $inputFile 中包含 "rectangle"

的每一行

1. Append the first column (variable "Xcoord") to output.txt
2. Append a comma to output.txt
3. Append the second column (variable "Ycoord") to output.txt
4. Append another comma to output.txt
5. Append the 1 or 0 as per the if test based on the value of the variable "count", along with a new line.

这个想法失败了。它没有将数据保存到文件中，而是将文件的所有列打印到标准输出，第一列替换为文本“（没有这样的文件或目录）”：

...而 output.txt 只是全是零：

我该如何解决这个问题？
我是否需要做任何事情来制作生成的 output.txt 文件 Windows-format？

提前致谢...

Answer 1

我认为 awk 可以在一行中满足您的所有需求：

 awk -F '[[:space:]][[:space:]]+' 'BEGIN{OFS = ","} {if ( == "rectangle") print ,  }' a.txt | awk 'BEGIN{OFS = ","}{if((NR+3)%4) print [=10=],0;else print [=10=],1}'

您通过

将条目之间的分隔符设置为“at least two spaces”

-F '[[:space:]][[:space:]]+

通过

将输出分隔符设置为逗号

'BEGIN{OFS = ","}

检查倒数第二列中的矩形条件

if ( == "rectangle")

并打印您想要的列作为输出

print ,

要在第三个输出列中添加 0,1 模式，您必须重新启动 awk 以获取结果文件的行号，而不是原始输入行。 awk NR 变量包含从 1 开始的行号。

(NR+3)%4

(% is modulo-operation)结果为 0 (=false) 行号 1,5,9,... 所以你只需要打印完整的行（变量 $0 ），然后是 if-case 中的 0 和 else 中的 1。

print [=16=],0;else print [=16=],1

希望这就是您想要的。

Answer 2

我想出了解决办法。

删除 header 行。
使用 grep 根据单词 "rectangle" 过滤所有行。
用逗号替换空格以便于处理。
遍历每一行，根据需要保存到文件。

#!/bin/bash
#Code here to retrieve the file from command arguments and set it as $inputFile (removed for brevity)
sed -i 1d $inputFile #Remove header line

sed 's/^ *//g' < $inputFile > work.txt #Remove first character in each line (a space).
tr -s ' ' <work.txt | tr ' ' ',' >work2.txt #Switch spaces for commas.
grep "rectangle" work2.txt > work3.txt #Print all lines containing "rectangle" in them to new file.
rm lineout.txt #Delete output file in case script was run previously.
touch lineout.txt
count=0
while IFS='' read -r line || [[ -n "$line" ]]; do
    printf "$line" > line.txt
    awk 'BEGIN { FS="," } { printf   >> "lineout.txt" }' line.txt
    printf "," >> lineout.txt
    awk 'BEGIN { FS="," } { printf   >> "lineout.txt" }' line.txt
    printf "," >> lineout.txt
    count=$((count + 1))
    if [[ $count = "1" ]]
    then
        printf "$count\n" >> lineout.txt
    else
        printf "0\n" >> lineout.txt
        if [[ $count = "4" ]]
        then
            count=0
        fi
    fi
done < work3.txt

Answer 3

这可以使用具有以下功能的 sublime 文本编辑器轻松格式化：

多选
垂直选择
搜索并替换类似于 bash 表达式

我不是想为 sublime 做广告，但这个工具确实解决了我的大部分文本编辑问题。

将具有不同分隔符的文件中的几个 space-delimited 字段提取到 Bash 中的另一个文件中

Extract several space-delimited fields from file with varying delimiters into another file in Bash

bash

delimiter