使用 Perl 计算管道分隔文本文件的每一列中的唯一值
Count unique values in each column of pipe delimited text file using Perl
我的 Solaris 服务器主目录中有两个示例竖线分隔文件,如下所示:
file1.txt:
ticker|sedol|cusip|exchnage
ibm |ibm_1|ib |london
hcl | |hcl_02|london
hp |hp_01|hpm |newyork
|lkp |lk_2 |newyork
file2.txt:
exchnage|ticker|sedol|cusip
london |goo |goo_1|gm
london |hcl | |hcl_02
newyork |hp |hp_01|hpm
newyork |tyl | |ty_2
我需要一个结果文件,我们将在其中按交易所对 ticker、sedol、cusip 的唯一计数进行分组:
预期结果文件如下:
exchnage|ticker|sedol|cusip
london |3 |2 |3
newyork |3 |2 |2
我知道使用 SQL 这很容易,但遗憾的是无法涉及数据库。每个文件大小可能高达 300-400 MB。我们最好使用 Perl 来完成它,或者如果有困难则 Python。主要环境是 Solaris,但我们也可以在 Unix 服务器上尝试。现在补充的是 "exchange" 列位置可以在两个文件的任何地方。
我认为,首先您需要从文件中创建字典列表,例如:
{'exchnage':'london','ticker':'ibm','sedol':'ibm_1','cusip':'ib'}。比您需要一个 for-each 语句将所有值添加到列表中,但仅当值不是 None 时,列表中不存在空值。然后你就得到了列表中的所有唯一值,你只需要计算它们。
您需要对文件中的所有列执行此操作。
之后,您需要将其写入文件。
#!/usr/bin/perl
use strict;
use warnings;
use Data::Dumper;
my %unique_ticker_count;
my %unique_sedol_count;
my %unique_cusip_count;
my @headers = split( '|', <DATA> );
print @headers;
while ( my $line = <DATA> ) {
my ( $exchange, $ticker, $sedol, $cusip ) = split( /\|/, $line );
if ( $ticker =~ m/\w/ ) { $unique_ticker_count{$exchange}{$ticker}++; }
if ( $sedol =~ m/\w/ ) { $unique_sedol_count{$exchange}{$sedol}++; }
if ( $cusip =~ m/\w/ ) { $unique_cusip_count{$exchange}{$cusip}++; }
}
print Dumper \%unique_ticker_count;
foreach my $exchange ( keys %unique_ticker_count ) {
print join( "|",
$exchange,
scalar keys %{ $unique_ticker_count{$exchange} } || 0,
scalar keys %{ $unique_sedol_count{$exchange} } || 0,
scalar keys %{ $unique_cusip_count{$exchange} } || 0,
),
"\n";
}
__DATA__
exchnage|ticker|sedol|cusip
london |ibm |ibm_1|ib
london |hcl | |hcl_02
newyork |hp |hp_01|hpm
newyork |lkp |lk_2 |
london |goo |goo_1|gm
london |hcl | |hcl_02
newyork |hp |hp_01|hpm
newyork |tyl | |ty_2
将打印:
newyork |3|2|2
london |3|2|3
我将把文件打开和处理留给您 - 这只是您可以采用的方法的示例。
不是最优雅的,但我为您的新要求打造的最快的:
import glob
import os
import sys
path = "/tmp"
file_mask = "file*.txt"
results = {}
for file in glob.glob(os.path.join(path, file_mask)):
column_names = {}
exchange_col = None
with open(file, "r") as f:
for line_num, line in enumerate(f.xreadlines()):
# process header
if not line_num:
line_parsed = line.strip().split("|")
for column_num, column in enumerate(line_parsed):
if column.strip() == "exchnage":
exchange_col = column_num
else:
column_names[column_num] = column.strip()
if exchange_col is None:
print "Can't find exchnage field"
sys.exit(1)
continue
line_parsed = line.strip().split("|")
if len(line_parsed) != len(column_names) + 1:
continue
# prepare empty structure for excahnge, if not added yet
if not line_parsed[exchange_col].strip() in results:
results[line_parsed[exchange_col].strip()] = {column_name:set() for column_name in column_names.values()}
# add uniq items to exchange
for column_num, column in enumerate(line_parsed):
column_val = column.strip()
# add only non empty values
if column_val and column_num != exchange_col:
results[line_parsed[exchange_col].strip()][column_names[column_num]].add(column_val)
column_names = column_names.values()
print "exchnage|" + "|".join("%8s" %c for c in column_names)
for exchange, values in results.iteritems():
print "%8s|" % exchange + "|".join("%8s" % str(len(values[column])) for column in column_names)
程序输出(作为输入,使用了不同列顺序的新文件):
$ python parser.py
exchnage| ticker| sedol| cusip
newyork| 2| 2| 3
london| 3| 2| 3
我的 Solaris 服务器主目录中有两个示例竖线分隔文件,如下所示:
file1.txt:
ticker|sedol|cusip|exchnage
ibm |ibm_1|ib |london
hcl | |hcl_02|london
hp |hp_01|hpm |newyork
|lkp |lk_2 |newyork
file2.txt:
exchnage|ticker|sedol|cusip
london |goo |goo_1|gm
london |hcl | |hcl_02
newyork |hp |hp_01|hpm
newyork |tyl | |ty_2
我需要一个结果文件,我们将在其中按交易所对 ticker、sedol、cusip 的唯一计数进行分组:
预期结果文件如下:
exchnage|ticker|sedol|cusip
london |3 |2 |3
newyork |3 |2 |2
我知道使用 SQL 这很容易,但遗憾的是无法涉及数据库。每个文件大小可能高达 300-400 MB。我们最好使用 Perl 来完成它,或者如果有困难则 Python。主要环境是 Solaris,但我们也可以在 Unix 服务器上尝试。现在补充的是 "exchange" 列位置可以在两个文件的任何地方。
我认为,首先您需要从文件中创建字典列表,例如:
{'exchnage':'london','ticker':'ibm','sedol':'ibm_1','cusip':'ib'}。比您需要一个 for-each 语句将所有值添加到列表中,但仅当值不是 None 时,列表中不存在空值。然后你就得到了列表中的所有唯一值,你只需要计算它们。
您需要对文件中的所有列执行此操作。
之后,您需要将其写入文件。
#!/usr/bin/perl
use strict;
use warnings;
use Data::Dumper;
my %unique_ticker_count;
my %unique_sedol_count;
my %unique_cusip_count;
my @headers = split( '|', <DATA> );
print @headers;
while ( my $line = <DATA> ) {
my ( $exchange, $ticker, $sedol, $cusip ) = split( /\|/, $line );
if ( $ticker =~ m/\w/ ) { $unique_ticker_count{$exchange}{$ticker}++; }
if ( $sedol =~ m/\w/ ) { $unique_sedol_count{$exchange}{$sedol}++; }
if ( $cusip =~ m/\w/ ) { $unique_cusip_count{$exchange}{$cusip}++; }
}
print Dumper \%unique_ticker_count;
foreach my $exchange ( keys %unique_ticker_count ) {
print join( "|",
$exchange,
scalar keys %{ $unique_ticker_count{$exchange} } || 0,
scalar keys %{ $unique_sedol_count{$exchange} } || 0,
scalar keys %{ $unique_cusip_count{$exchange} } || 0,
),
"\n";
}
__DATA__
exchnage|ticker|sedol|cusip
london |ibm |ibm_1|ib
london |hcl | |hcl_02
newyork |hp |hp_01|hpm
newyork |lkp |lk_2 |
london |goo |goo_1|gm
london |hcl | |hcl_02
newyork |hp |hp_01|hpm
newyork |tyl | |ty_2
将打印:
newyork |3|2|2
london |3|2|3
我将把文件打开和处理留给您 - 这只是您可以采用的方法的示例。
不是最优雅的,但我为您的新要求打造的最快的:
import glob
import os
import sys
path = "/tmp"
file_mask = "file*.txt"
results = {}
for file in glob.glob(os.path.join(path, file_mask)):
column_names = {}
exchange_col = None
with open(file, "r") as f:
for line_num, line in enumerate(f.xreadlines()):
# process header
if not line_num:
line_parsed = line.strip().split("|")
for column_num, column in enumerate(line_parsed):
if column.strip() == "exchnage":
exchange_col = column_num
else:
column_names[column_num] = column.strip()
if exchange_col is None:
print "Can't find exchnage field"
sys.exit(1)
continue
line_parsed = line.strip().split("|")
if len(line_parsed) != len(column_names) + 1:
continue
# prepare empty structure for excahnge, if not added yet
if not line_parsed[exchange_col].strip() in results:
results[line_parsed[exchange_col].strip()] = {column_name:set() for column_name in column_names.values()}
# add uniq items to exchange
for column_num, column in enumerate(line_parsed):
column_val = column.strip()
# add only non empty values
if column_val and column_num != exchange_col:
results[line_parsed[exchange_col].strip()][column_names[column_num]].add(column_val)
column_names = column_names.values()
print "exchnage|" + "|".join("%8s" %c for c in column_names)
for exchange, values in results.iteritems():
print "%8s|" % exchange + "|".join("%8s" % str(len(values[column])) for column in column_names)
程序输出(作为输入,使用了不同列顺序的新文件):
$ python parser.py
exchnage| ticker| sedol| cusip
newyork| 2| 2| 3
london| 3| 2| 3