Uniqing 基于字段子集的分隔文件

Question

我有如下数据：

1493992429103289,207.55,207.5
1493992429103559,207.55,207.5
1493992429104353,207.55,207.5
1493992429104491,207.6,207.55
1493992429110551,207.55,207.5

由于最后两列的性质，它们的值全天都在变化，并且它们的值会定期重复。通过对我想要的输出（如下）中概述的方式进行分组，我能够在每次它们的值发生变化时查看（第一列中有以诺时间）。有没有办法实现如下所示的所需输出：

1493992429103289,207.55,207.5
1493992429104491,207.6,207.55
1493992429110551,207.55,207.5

所以我按后两列合并数据。不过，盘整并不完全是唯一的（从207.55可以看出，207.5正在重复）

我试过：

uniq -f 1

但是输出只给出了第一行，并没有继续遍历列表

下面的 awk 解决方案不允许再次输出之前发生的事件，因此给出了输出（在 awk 代码下方）：

awk '!x[ ]++'

1493992429103289,207.55,207.5
1493992429104491,207.6,207.55

我不想按后两列对数据进行排序。不过由于first是纪元时间，所以可能按第一列排序。

Answer 1

您可以使用如下 Awk 语句，

awk 'BEGIN{FS=OFS=","} s !=  && t !=  {print} {s=;t=}' file

根据需要生成输出。

1493992429103289,207.55,207.5
1493992429104491,207.6,207.55
1493992429110551,207.55,207.5

思路是将第二列和第三列的值分别存储在变量s和t中，只在当前行打印行内容独一无二.

Answer 2

我找到了一个答案，它不像 Inian 那样优雅，但满足了我的目的。由于我的第一列总是以微秒为单位的时间并且不会增加或减少字符，所以我可以使用以下 uniq 命令：

uniq -s 17

Answer 3

您可以尝试手动（使用循环）将当前行与上一行进行比较。

previous_line=""
# start at first line
i=1

# suppress first column, that don't need to compare
sed 's@^[0-9][0-9]*,@@' ./data_file > ./transform_data_file

# for all line within file without first column
for current_line in $(cat ./transform_data_file)
do 
  # if previous record line are same than current line
  if [ "x$prev_line" == "x$current_line" ]
  then
    # record line number to supress after
    echo $i >> ./line_to_be_suppress
  fi

  # record current line as previous line
  prev_line=$current_line

  # increment current number line
  i=$(( i + 1 ))
done

# suppress lines
for line_to_suppress in $(tac ./line_to_be_suppress) ; do sed -i $line_to_suppress'd' ./data_file ; done

rm line_to_be_suppress
rm transform_data_file

Answer 4

不能用uniq设置分隔符，必须是白色space。在 tr 的帮助下，您可以

tr ',' ' ' <file | uniq -f1 | tr ' ' ','

1493992429103289,207.55,207.5
1493992429104491,207.6,207.55
1493992429110551,207.55,207.5

Answer 5

因为你的第一个字段似乎有 18 个字符的固定长度（包括 , 分隔符），你可以使用 uniq 的 -s 选项，这会更大文件的最佳选择：

uniq -s 18 file

给出此输出：

1493992429103289,207.55,207.5
1493992429104491,207.6,207.55
1493992429110551,207.55,207.5

来自man uniq：

-f num

Ignore the first num fields in each input line when doing comparisons. A field is a string of non-blank characters separated from adjacent fields by blanks. Field numbers are one based, i.e., the first field is field one.

-s chars

Ignore the first chars characters in each input line when doing comparisons. If specified in conjunction with the -f option, the first chars characters after the first num fields will be ignored. Character numbers are one based, i.e., the first character is character one.

Uniqing 基于字段子集的分隔文件

Uniqing a delimited file based on a subset of fields

linux

bash

shell

awk

uniq