如何从前 10 个字符的标题编号链接的两个文本文件中提取行？

Question

我有两个文件：

file1.txt:

0000001435 XYZ 与 ABC
0000001438warlaugh 世界

file1.txt:

0000001435 XYZ with abc
0000001436 DFC whatever
0000001437 FBFBBBF
0000001438 world of warlaugh

分隔文件中的行由数字链接（前 10 个字符）。所需的输出是一个制表符分隔的文件，其中包含存在的行和 file1.txt 以及来自 file2.txt:

的相应行

file3.txt:

XYZ 与 ABC   XYZ with abc
warlaugh 世界 world of warlaugh

如何获取相应的行，然后使用 file1.txt 中存在的行创建制表符分隔文件以生成 file3.txt?

请注意，只有前 10 个字符构成 ID。，有 0000001438warlaugh 世界 甚至 0000001432231hahaha lol 的情况，只有 0000001438 和 0000001432 是 ID。

我试过 python, getfile3.py:

import io
f1 = {line[:10]:line[10:].strip() for line in io.open('file1.txt', 'r', encoding='utf8')}
f2 = {line[:10]:line[10:].strip() for line in io.open('file1.txt', 'r', encoding='utf8')}

f3 = io.open('file3.txt', 'w', encoding='utf8') 

for i in f1:
  f3.write(u"{}\t{}\n".format(f1[i], f2[i]))

但是有没有 bash/awk/grep/perl command-line 我可以得到 file3.txt 的方法？

Answer 1

awk '
{ key = substr([=10=],1,10); data = substr([=10=],11) }
NR==FNR { file1[key] = data; next }
key in file1 { print file1[key] data }
' file1 file2

如果您愿意，可以将 FIELDWIDTHS 与 GNU awk 一起使用而不是 substr()。

Answer 2

超长的 Perl 答案：

use warnings;
use strict;

# add files here as needed
my @input_files = qw(file1.txt file2.txt);
my $output_file = 'output.txt';

# don't touch anything below this line
my @output_lines = parse_files(@input_files);

open (my $output_fh, ">", $output_file) or die;
foreach (@output_lines) {
    print $output_fh "$_\n";                    #print to output file
    print "$_\n";                               #print to console
}
close $output_fh;

sub parse_files {
    my @input_files = @_;                       #list of text files to read.
    my %data;                                   #will store $data{$index} = datum1 datum2 datum3

    foreach my $file (@input_files) {           
        open (my $fh, "<", $file) or die;       
        while (<$fh>) { 
            chomp;                              
            if (/^(\d{10})\s?(.*)$/) {
                my $index = ;
                my $datum = ;
                if (exists $data{$index}) {
                    $data{$index} .= "\t$datum";
                } else {
                    $data{$index} = $datum;
                } #/else
            } #/if regex found
        } #/while reading current file
        close $fh;
    } #/foreach file

    # Create output array
    my @output_lines;
    foreach my $key (sort keys %data) {
        push (@output_lines, "$data{$key}");
    } #/foreach

    return @output_lines;
} #/sub parse_files

如何从前 10 个字符的标题编号链接的两个文本文件中提取行？

How to extract lines from two textfiles linked by heading number from the 1st 10 characters?

python

bash

perl

awk