使用 awk 的最近邻居

Question

这就是我使用 AWK 语言尝试做的事情。我主要在第 2 步遇到问题。我展示了一个样本数据集，但原始数据集包含 100 个字段和 2000 条记录。

算法

1) 初始化精度 = 0

2) 每条记录 r

     Find the closest other record, o, in the dataset using distance formula

要找到 r0 的最近邻居，我需要将 r0 与 r1 到 r9 进行比较，并按如下方式进行数学运算：square(abs(r0.c1 - r1.c1)) + square(abs (r0.c2 - r1.c2)) + ...+square(abs(r0.c5 - r1.c5)) 并存储这些距离。

3) 最小距离的一个，比较它的c6值。如果 c6 值相等，则精度增加 1。

对所有记录重复该过程后。

4) 最后，通过 (accuracy/total_records) * 100;

示例数据集

        c1   c2   c3   c4   c5   c6  --> Columns
  r0  0.19 0.33 0.02 0.90 0.12 0.17  --> row1 & row7 nearest neighbour in c1
  r1  0.34 0.47 0.29 0.32 0.20 1.00      and same values in c6(0.3) so ++accuracy
  r2  0.37 0.72 0.34 0.60 0.29 0.15 
  r3  0.43 0.39 0.40 0.39 0.32 0.27 
  r4  0.27 0.41 0.08 0.19 0.10 0.18 
  r5  0.48 0.27 0.68 0.23 0.41 0.25 
  r6  0.52 0.68 0.40 0.75 0.75 0.35 
  r7  0.55 0.59 0.61 0.56 0.74 0.76 
  r8  0.04 0.14 0.03 0.24 0.27 0.37 
  r9  0.39 0.07 0.07 0.08 0.08 0.89

代码

BEGIN   {
            #initialize accuracy and total_records
            accuracy = 0;
            total_records = 10;
        }


NR==FNR {    # Loop through each record and store it in an array
                for (i=1; i<=NF; i++) 
                {
                     records[i]=$i;
                }
            next             
        }

        {   # Re-Loop through the file and compare each record from the array with each record in a file    
              for(i=1; i <= length(records); i++)
              {
                   for (j=1; j<=NF; j++) 
                   {      # here I need to get the difference of each field of the record[i] with each all the records, square them and sum it up. 
                          distance[j] = (records[i] - $j)^2;
                   }
               #Once I have all the distance, I can simply compare the values of field_6 for the record with least distance.
              if(min(distance[j]))
              {
                  if(records[] == )
                  {
                        ++accuracy;
                  } 
              }
       }
END{
     percentage = 100 * (accuracy/total_records); 
     print percentage;
}

Answer 1

这是一种方法

$ cat -n file > nfile
$ join nfile{,} -j99 | 
  awk 'function abs(x) {return x>0?x:-x}  
           < {minc=999;for(i=2;i<7;i++) 
                 {d=abs($i-$(i+7)); 
                  if(d<minc)minc=d} 
                  print ,minc,==}' | 
  sort -u -k1,2 -k3r | 
  awk '!a[]++{sum+=} END{print sum}'

7

由于对称性，您只需要比较 n*(n-1)/2 条记录，更容易通过 join 设置它以准备所有匹配项并过滤掉多余的 <，找到最小值每条记录的列距离和记录的最后一个字段的匹配==，找到每条记录的最小距离，按第一个记录号和距离排序，最后得到匹配条目的总和。

对于你的公式，我猜结果将是 100*2*7/10=140%，因为你重复计算（R1~R7 和 R7~R1），否则 70%

更新
使用新的距离函数，脚本可以重写为

$ join nfile{,} -j999 | 
  awk '< {d=0; 
              for(i=2;i<7;i++) d+=($i-$(i+7))^2; 
              print ,d,==}' | 
  sort -k1,2n -k3r | 
  awk '!a[]++{sum+=;count++} 
            END{print 100*sum/(count+1)"%"}'

70%

说明

cat -n file > nfile 创建一个带有记录编号的新文件。 join 不能从 stdin 中获取两个文件，因此您必须创建一个临时文件。

join nfile{,} -j999记录的叉积（每条记录将与每条记录连接（类似于两个嵌套循环的效果）

<会将记录过滤到叉积的上三角截面（如果你把它想象成一个二维矩阵）。

for(i=2;i<7;i++) d+=($i-$(i+7))^2;计算每条记录相对于其他记录的距离平方

print ,d,== 从记录、距离平方和最后一个字段是否匹配的指标打印

sort -u -k1,2 -k3r 找到每条记录的最小值，对第 3 个字段进行反向排序，如果有的话，1 将排在第一位。

a[]++{sum+=;count++} 计算行数并对每个记录的指标求和

END{print 100*sum/(count+1)"%"}字段数比记录多一个，转换为百分比格式。

我建议分阶段了解正在发生的事情运行每个管道部分并尝试验证中间结果。

对于您的真实数据，您必须更改硬编码参考值。加入的字段应该多于您的字段数。

使用 awk 的最近邻居

One nearest neighbour using awk

bash

shell

awk

text-processing

gawk

算法

示例数据集

代码