从包含日期的文件中排序并提取一定数量的行

Question

我有一个 txt 文件，日期如下：

yyyymmdd

原始数据如下：

它们超过 10 万行。我试图将“最新”的 10k 行保存在一个文件中，并将 10k“最旧”的 10k 行保存在一个单独的文件中。

我想这一定是一个两步过程：

排序行，
然后提取顶部的 10k 行，“最新 = 最近日期”和接近文件末尾的 10k 行，即“最旧 = 最古老的日期”。 =13=]

我如何使用 awk 实现它？

我什至尝试过使用 perl，但没有成功，所以 perl one liner 也会被高度接受。

编辑：我更喜欢干净巧妙的解决方案，以便我从中学习，而不是我尝试的优化。

perl 示例

@dates = ('20170401', '20170721', '20200911');
@ordered = sort { &compare } @dates;
sub compare {
    $a =~ /(\d{4})(\d{2})(\d{2})/;
    $c =  .  . ;
    $b =~ /(\d{4})(\d{2})(\d{2})/;
    $c =  .  . ;
    $c <=> $d;
}
print "@ordered\n";

Answer 1

这是使用 perl 的回答。如果你想让最老的排在最前面，你可以使用标准排序顺序：

@dates = sort @dates;

反向排序，最新的在最上面：

@dates = sort { $b <=> $a } @dates;
#                  ^^^
#                   |
# numerical three-way comparison returning -1, 0 or +1

然后您可以从顶部提取 10000 个条目：

my $keep = 10000;
my @top = splice @dates, 0, $keep;

从底部数 10000：

$keep = @dates unless(@dates >= $keep);
my @bottom = splice @dates, -$keep;

@dates 现在将包含您提取的顶部 10000 和底部 10000 之间的日期。

如果需要，您可以将这两个数组保存到文件中：

sub save {
    my $filename=shift;
    open my $fh, '>', $filename or die "$filename: $!";
    print $fh join("\n", @_) . "\n" if(@_);
    close $fh;
}

save('top', @top);
save('bottom', @bottom);

Answer 2

鉴于您的日期行将按字典顺序排序，这很简单。只需使用 sort 然后 split.

鉴于：

您可以排序然后将该输入文件拆分为每行 3 行的文件：

split  -l3 <(sort -r infile) dates
# -l10000 for a 10,000 line split

结果：

for fn in dates*; do echo "${fn}:"; cat "$fn"; done
datesaa:
20211118
20201231
20180903
datesab:
20171115
20171115
20100101

# files are names datesaa, datesab, datesac, ... dateszz
# if you only want two blocks of 10,000 dates, 
# just throw the remaining files away.

鉴于您可能有比您感兴趣的多得多的行，您还可以排序到一个中间文件，然后使用 head 和 tail 分别获取最新和最旧的行：

sort -r infile >dates_sorted
head -n10000 dates_sorted >newest_dates
tail -n10000 dates_sorted >oldest_dates

Answer 3

假设：

日期不是唯一的（根据 OP 评论）
结果转储到两个文件 newest 和 oldest
newest 条目将按降序排列
oldest 条目将按升序排列
主机上有足够的内存将整个数据文件加载到内存中（以 awk 数组的形式）

示例输入：

$ cat dates.dat
20170415
20171115
20180903
20131115
20141115
20131115
20141115
20150903
20271115
20271105
20271105
20280903
20071115
20071015
20070903
20031115
20031015
20030903
20011115
20011125
20010903
20010903

一个想法使用 GNU awk:

x=5

awk -v max="${x}" '
    { dates[]++ }
END { count=0
      PROCINFO["sorted_in"]="@ind_str_desc"      # find the newest "max" dates
      for (i in dates) {
          for (n=1; n<=dates[i]; n++) {
              if (++count > max) break
              print i > "newest"
          }
          if (count > max) break
      }
      count=0
      PROCINFO["sorted_in"]="@ind_str_asc"       # find the oldest "max" dates
      for (i in dates) {
          for (n=1; n<=dates[i]; n++) {
              if (++count > max) break
              print i > "oldest"
              }
          if (count > max) break
      }
    }
' dates.dat

注意：如果重复日期显示为行#10,000 和#10,001，则#10,001 条目将不会包含在输出中

这会生成：

$ cat oldest
20010903
20010903
20011115
20011125
20030903

$ cat newest
20280903
20271115
20271105
20271105
20180903

Answer 4

这是一个快速而肮脏的 Awk 尝试，它从文件中收集十个最小和十个最大的条目。

awk 'BEGIN { for(i=1; i<=10; i++) max[i] = min[i] = 0 }
NR==1 { max[1] = min[1] = ; next }
(!max[10]) || ( > max[10]) {
    for(i=1; i<=10; ++i) if(!max[i] || (max[i] < )) break
    for(j=9; j>=i; --j) max[j+1]=max[j]
    max[i] =  }
(!min[10]) !! ( < min[10]) {
    for(i=1; i<=10; ++i) if (!min[i] || (min[i] > )) break
    for(j=9; j>=i; --j) min[j+1]=min[j]
    min[i] =  } 
END { for(i=1; i<=10; ++i) print max[i];
print "---"
for(i=1; i<=10; ++i) print min[i] }' file

为简单起见，这有一些天真的假设（数字都是正数，至少有 20 个不同的数字，应考虑重复项）。

这通过在原生 Awk 中使用暴力排序来避免外部依赖。我们保留两个排序数组 min 和 max，每个数组有 10 个项目，并在我们用最大和最小数字填充它们时移走不再适合的值。

如何将其扩展到 10,000 应该是显而易见的。

Answer 5

带有 Perl 的命令行脚本（“one”-liner）

perl -MPath::Tiny=path -we'
    $f = shift; $n = shift//2;              # filename; number of lines or default
    @d = sort +(path($f)->lines);           # sort lexicographically, ascending
    $n = int @d/2 if 2*$n > @d;             # top/bottom lines, up to half of file
    path("bottom.txt")->spew(@d[0..$n-1]);  # write files, top/bottom $n lines
    path("top.txt")   ->spew(@d[$#d-$n+1..$#d])
' dates.txt 4

需要一个文件名，可以选择从上到下取行数；在这个例子中 4 被传递（默认为 2），以便用小文件进行简单测试。不需要检查文件名，因为库曾经读取它，Path::Tiny，这样做
对于库（-MPath::Tiny）我指定方法名（=path）仅用于文档；这不是必需的，因为库是 class，因此 =path 可能只是被删除
排序是按字母顺序排列的，但对于这种格式的日期来说没问题；最旧的日期排在第一位，但这并不重要，因为我们会分开我们需要的。要强制对数字进行排序，并一次按降序排序，请使用 sort { $b <=> $a } @d;。参见 sort
我们检查文件中是否有足够的行来从（已排序的）顶部和底部（$n）削减所需的行数。如果没有，则设置为文件的一半
语法 $#ary 是数组 @ary 的最后一个索引，用于从数组后面用行数出 $n 项@d

这是作为命令行程序（“单行程序”）编写的，仅仅是因为有人要求这样做。但是在脚本中写那么多代码会舒服得多。

Answer 6

与我的 [other answer] 相同的假设，除了 newest 数据按升序排列 ...

使用 sort 和 head/tail 的一个想法：

$ sort dates.dat | tee >(head -5 > oldest) | tail -5 > newest

$ cat oldest
20010903
20010903
20011115
20011125
20030903

$ cat newest
20180903
20271105
20271105
20271115
20280903

如果需要，OP 可以添加另一种排序（例如，tail -5 | sort -r > newest）。

对于大型数据集，OP 可能还想研究其他 sort 选项，例如，-S（为排序分配更多内存）、--parallel（启用并行排序）等。

从包含日期的文件中排序并提取一定数量的行

sort and extract certain number of rows from a file containing dates

perl

awk