如何根据任何列只获取最新的唯一记录
How to take only latest uniq record based on any column
我正在用 perl 编写脚本。但卡在了一部分。以下是我的 csv 文件示例。
"MP","918120197922","20150806125001","prepaid","prepaid","3G","2G"
"GJ","919904303790","20150806125002","prepaid","prepaid","2G","3G"
"MH","919921990805","20150806125003","prepaid","prepaid","2G"
"MP","918120197922","20150806125004","prepaid","prepaid","2G"
"MUM","919904303790","20150806125005","prepaid","prepaid","2G","3G"
"MUM","918652624178","20150806125005","","prepaid","","2G","NEW"
"MP","918120197922","20150806125005","prepaid","prepaid","2G","3G"
现在我需要根据第 2 列(即手机号码)获取唯一记录,但只考虑第 3 列(即时间戳)的最新值
例如:手机号码“918120197922”。
"MP","918120197922","20150806125001","prepaid","prepaid","3G","2G"
"MP","918120197922","20150806125004","prepaid","prepaid","2G"
"MP","918120197922","20150806125005","prepaid","prepaid","2G","3G"
它应该 select 第三条记录,因为它具有最新的时间戳值 (20150806125005)。请帮忙
附加信息:
很抱歉数据不一致..我现在已经纠正了。
是的,数据是有序的,这意味着最新的时间戳将出现在最新的行中。
还有一件事,我的文件大小超过 1 GB,那么有什么方法可以有效地做到这一点吗?在这种情况下,awk 会比 perl 工作得更快。请帮忙?
使用Text::CSV处理CSV文件。
散列第 2 列的行,只保留最近的行。
#!/usr/bin/perl
use warnings;
use strict;
use Text::CSV;
my $csv = 'Text::CSV'->new() or die 'Text::CSV'->error_diag;
my %hash;
open my $CSV, '<', '1.csv' or die $!;
while (my $row = $csv->getline($CSV)) {
my ($number, $timestamp) = @$row[1, 2];
# Store the row if the timestamp is more recent than the stored one.
$hash{$number} = $row if $timestamp gt ($hash{$number}[2] || q());
}
$csv->eol("\n");
$csv->always_quote(1);
open my $OUT, '>', 'uniq.csv' or die $!;
for my $row (values %hash) {
$csv->print($OUT, $row);
}
close $OUT or die $!;
如果您知道您的数据是按时间戳排序的,您可以利用它并向后读取它们并将您的任务转换为一个问题以输出每个 phone 数字的第一次出现。
#!/usr/bin/env perl
use strict;
use warnings;
use autodie;
use Text::CSV_XS;
use constant PHONENUM_FIELD => 1;
my $filename = shift;
die "Usage: [=10=] <filename>\n" unless defined $filename;
open my $in, '-|', 'tac', $filename;
my $csv = Text::CSV_XS->new( { binary => 1, auto_diag => 1, eol => $/ } );
my %seen;
while ( my $row = $csv->getline($in) ) {
$csv->print( *STDOUT, $row ) unless $seen{ $row->[PHONENUM_FIELD] }++;
}
如果您希望输出与输入的顺序相同,您也可以写入 tac
:
#!/usr/bin/env perl
use strict;
use warnings;
use autodie;
use Text::CSV_XS;
use constant PHONENUM_FIELD => 1;
my $filename = shift;
die "Usage: [=11=] <filename>\n" unless defined $filename;
open my $in, '-|', 'tac', $filename;
open my $out, '|-', 'tac';
my $csv = Text::CSV_XS->new( { binary => 1, auto_diag => 1, eol => $/ } );
my %seen;
while ( my $row = $csv->getline($in) ) {
$csv->print( $out, $row ) unless $seen{ $row->[PHONENUM_FIELD] }++;
}
1GB 在任何像样的硬件上都应该不是问题。在我的旧笔记本上,处理 29360128 行和 1.8GB 需要 2m3.393s。它超过 230krows/s 但 YMMV。如果您有兴趣在输出中获得引用的所有值,请将 always_quote => 1
添加到 $csv
构造函数参数。
我正在用 perl 编写脚本。但卡在了一部分。以下是我的 csv 文件示例。
"MP","918120197922","20150806125001","prepaid","prepaid","3G","2G"
"GJ","919904303790","20150806125002","prepaid","prepaid","2G","3G"
"MH","919921990805","20150806125003","prepaid","prepaid","2G"
"MP","918120197922","20150806125004","prepaid","prepaid","2G"
"MUM","919904303790","20150806125005","prepaid","prepaid","2G","3G"
"MUM","918652624178","20150806125005","","prepaid","","2G","NEW"
"MP","918120197922","20150806125005","prepaid","prepaid","2G","3G"
现在我需要根据第 2 列(即手机号码)获取唯一记录,但只考虑第 3 列(即时间戳)的最新值 例如:手机号码“918120197922”。
"MP","918120197922","20150806125001","prepaid","prepaid","3G","2G"
"MP","918120197922","20150806125004","prepaid","prepaid","2G"
"MP","918120197922","20150806125005","prepaid","prepaid","2G","3G"
它应该 select 第三条记录,因为它具有最新的时间戳值 (20150806125005)。请帮忙
附加信息: 很抱歉数据不一致..我现在已经纠正了。 是的,数据是有序的,这意味着最新的时间戳将出现在最新的行中。 还有一件事,我的文件大小超过 1 GB,那么有什么方法可以有效地做到这一点吗?在这种情况下,awk 会比 perl 工作得更快。请帮忙?
使用Text::CSV处理CSV文件。
散列第 2 列的行,只保留最近的行。
#!/usr/bin/perl
use warnings;
use strict;
use Text::CSV;
my $csv = 'Text::CSV'->new() or die 'Text::CSV'->error_diag;
my %hash;
open my $CSV, '<', '1.csv' or die $!;
while (my $row = $csv->getline($CSV)) {
my ($number, $timestamp) = @$row[1, 2];
# Store the row if the timestamp is more recent than the stored one.
$hash{$number} = $row if $timestamp gt ($hash{$number}[2] || q());
}
$csv->eol("\n");
$csv->always_quote(1);
open my $OUT, '>', 'uniq.csv' or die $!;
for my $row (values %hash) {
$csv->print($OUT, $row);
}
close $OUT or die $!;
如果您知道您的数据是按时间戳排序的,您可以利用它并向后读取它们并将您的任务转换为一个问题以输出每个 phone 数字的第一次出现。
#!/usr/bin/env perl
use strict;
use warnings;
use autodie;
use Text::CSV_XS;
use constant PHONENUM_FIELD => 1;
my $filename = shift;
die "Usage: [=10=] <filename>\n" unless defined $filename;
open my $in, '-|', 'tac', $filename;
my $csv = Text::CSV_XS->new( { binary => 1, auto_diag => 1, eol => $/ } );
my %seen;
while ( my $row = $csv->getline($in) ) {
$csv->print( *STDOUT, $row ) unless $seen{ $row->[PHONENUM_FIELD] }++;
}
如果您希望输出与输入的顺序相同,您也可以写入 tac
:
#!/usr/bin/env perl
use strict;
use warnings;
use autodie;
use Text::CSV_XS;
use constant PHONENUM_FIELD => 1;
my $filename = shift;
die "Usage: [=11=] <filename>\n" unless defined $filename;
open my $in, '-|', 'tac', $filename;
open my $out, '|-', 'tac';
my $csv = Text::CSV_XS->new( { binary => 1, auto_diag => 1, eol => $/ } );
my %seen;
while ( my $row = $csv->getline($in) ) {
$csv->print( $out, $row ) unless $seen{ $row->[PHONENUM_FIELD] }++;
}
1GB 在任何像样的硬件上都应该不是问题。在我的旧笔记本上,处理 29360128 行和 1.8GB 需要 2m3.393s。它超过 230krows/s 但 YMMV。如果您有兴趣在输出中获得引用的所有值,请将 always_quote => 1
添加到 $csv
构造函数参数。