最接近值的不同文件，具有不同的行数和其他条件（bash awk 其他）

Question

我必须恢复并修改长文件。

我在两个文件（文件 1 和文件 2）中有两颗星的年龄。星星年龄的那一栏是$1，其余直到$13的栏是我需要在最后打印的信息。

我试图找到一个年龄与恒星年龄相同或最接近的年龄。由于文件太大（~25000 行），我不想在整个数组中搜索速度问题。此外，它们的行数可能会有很大差异（在某些情况下假设为 ~10000）

我不确定这是否是解决问题的最佳方法，但在没有更好的方法的情况下，这是我的想法。（如果你有更快更有效的方法，请采纳）

所有数值均保留12位小数。现在我只关心第一列（年龄所在）。

我需要不同的厕所ps。

让我们使用文件 1 中的这个值：

2.326062371284e+05

首先例程应在文件 2 中搜索包含

的所有匹配项

2.3260e+05

（这个循环可能会搜索整个数组，但是如果有办法在到达 2.3261 时停止搜索，那么它会节省一些时间）

如果只找到一个，则输出应该是那个值。

通常会找到好几行，甚至多达1000行。如果是这种情况，它应该重新搜索

2.32606e+05

字里行间创立之前。（我认为这是一个嵌套循环）然后匹配的数量将减少到 ~200

在那一刻，例程应该搜索

之间具有一定容差X的最佳差异

2.326062371284e+05

以及所有这 200 行。

这样就有了这些文件

文件 1

1.833800650355e+05 col2f1 col3f1 col4f1
1.959443501406e+05 col2f1 col3f1 col4f1
2.085086352458e+05 col2f1 col3f1 col4f1
2.210729203510e+05 col2f1 col3f1 col4f1
2.326062371284e+05 col2f1 col3f1 col4f1
2.441395539059e+05 col2f1 col3f1 col4f1
2.556728706833e+05 col2f1 col3f1 col4f1

文件 2

2.210729203510e+05 col2f2 col3f2 col4f2
2.354895663228e+05 col2f2 col3f2 col4f2
2.499062122946e+05 col2f2 col3f2 col4f2
2.643228582664e+05 col2f2 col3f2 col4f2
2.787395042382e+05 col2f2 col3f2 col4f2
2.921130362004e+05 col2f2 col3f2 col4f2
3.054865681626e+05 col2f2 col3f2 col4f2

输出文件 3（公差 3000）

2.210729203510e+05 2.210729203510e+05 col2f1 col2f2 col4f1 col3f2
2.326062371284e+05 2.354895663228e+05 col2f1 col2f2 col4f1 col3f2

重要条件：

输出不应该包含重复的行（明星1不能有固定年龄，明星2不同年龄，只是最接近的。

你会如何解决这个问题？

超级感谢！

ps: 我已经完全改变了这个问题，因为它告诉我我的推理有一些错误。谢谢！

Answer 1

Perl 助你一臂之力。这应该非常快，因为它在给定范围内进行二进制搜索。

#!/usr/bin/perl
use warnings;
use strict;
use feature qw{ say };

use List::Util qw{ max min };
use constant { SIZE      => 100,
               TOLERANCE => 3000,
           };


my @times2;
open my $F2, '<', 'file2' or die $!;
while (<$F2>) {
    chomp;
    push @times2, $_;
}

my $num = 0;
open my $F1, '<', 'file1' or die $!;
while (my $time = <$F1>) {
    chomp $time;

    my $from = max(0, $num - SIZE);
    my $to   = min($#times2, $num + SIZE);
    my $between;
    while (1) {
        $between = int(($from + $to) / 2);

        if ($time < $times2[$between] && $to != $between) {
            $to = $between;

        } elsif ($time > $times2[$between] && $from != $between) {
            $from = $between;

        } else {
            last
        }
    }
    $num++;
    if ($from != $to) {
        my $f = $time - $times2[$from];
        my $t = $times2[$to] - $time;
        $between = ($f > $t) ? $to : $from;
    }
    say "$time $times2[$between]" if TOLERANCE >= abs $times2[$between] - $time;
}

Answer 2

不是 awk 解决方案，其他解决方案也很棒，所以这里是使用 R

的答案

具有不同数据的新答案，这次不是从文件中读取来烘焙示例：

# Sample data for code, use fread to read from file and setnames to name the colmumns accordingly
set.seed(123)
data <- data.table(age=runif(20)*1e6,name=sample(state.name,20),sat=sample(mtcars$cyl,20),dens=sample(DNase$density,20))
data2 <- data.table(age=runif(10)*1e6,name=sample(state.name,10),sat=sample(mtcars$cyl,10),dens=sample(DNase$density,10))

setkey(data,'age') # Set the key for joining to the age column
setkey(data2,'age') # Set the key for joining to the age column

# get the result
result=data[ # To get the whole datas from file 1 and file 2 at end
         data2[ 
           data, # Search for each star of list 1
           .SD, # return columns of file 2
           roll='nearest',by=.EACHI, # Join on each line (left join) and find nearest value
          .SDcols=c('age','name','dens')]
       ][!duplicated(age) & abs(i.age - age) < 1e3,.SD,.SDcols=c('age','i.age','name','i.name','dens','i.dens') ] # filter duplicates in first file and on difference
# Write results to a file (change separator for wish):
write.table(format(result,digits=15,scientific=TRUE),"c:/test.txt",sep=" ")

代码：

# A nice package to have, install.packages('data.table') if it's no present
library(data.table)
# Read the data (the text can be file names)
stars1 <- fread("1.833800650355e+05
1.959443501406e+05
2.085086352458e+05
2.210729203510e+05
2.326062371284e+05
2.441395539059e+05
2.556728706833e+05")

stars2 <- fread("2.210729203510e+05
2.354895663228e+05
2.499062122946e+05
2.643228582664e+05
2.787395042382e+05
2.921130362004e+05
3.054865681626e+05")

# Name the columns (not needed if the file has a header)
colnames(stars1) <- "age"
colnames(stars2) <- "age"

# Key the data tables (for a fast join with binary search later)
setkey(stars1,'age')
setkey(stars2,'age')

# Get the result (more datils below on what is happening here :))
result=stars2[ stars1, age, roll="nearest", by=.EACHI]

# Rename the columns so we acn filter whole result
setnames(result,make.unique(names(result)))

# Final filter on difference
result[abs(age.1 - age) < 3e3]

所以有趣的部分是两个恒星年龄列表中的第一个 'join'，在 stars1 中搜索每个在 stars2 中最近的。

这给出（列重命名后）：

> result
        age    age.1
1: 183380.1 221072.9
2: 195944.4 221072.9
3: 208508.6 221072.9
4: 221072.9 221072.9
5: 232606.2 235489.6
6: 244139.6 249906.2
7: 255672.9 249906.2

现在我们每个都有最近的，过滤那些足够接近的（这里的绝对差异大于 3 000）：

> result[abs(age.1 - age) < 3e3]
        age    age.1
1: 221072.9 221072.9
2: 232606.2 235489.6

最接近值的不同文件，具有不同的行数和其他条件（bash awk 其他）

Closest value different files, with different number of lines and other conditions ( bash awk other)

bash

awk

sed

gawk