Perl:处理大文件并存储到数据库中
Perl: Processing large files and storing into database
我的任务是修改一个 Perl 脚本,该脚本读取 12 个文件(每个约 1GB,约 400 万 entries/file)。问题是我不知道任何 perl。但是,我能够为我的用例成功修改脚本。问题是脚本需要大量时间来处理文件以及将条目插入数据库。如果 pointers/suggestions 减少时间,我们将不胜感激。
来自其中一个文件的示例条目行(修改以保护身份)如下:
10.0.0.25 [06/Aug/2015:06:00:02 +0000] "0.002" "200" "0.002" "172.16.2.57:7777" "-" "GET /txt/AKXBPPYICZIBGM/n1/19757_705326?dc=us2&ext_user_id=1400587512&si=149042592 HTTP/1.1" "http://xyz.xyz/site_view.xhtml?cmid=22211051&get-title=Test%20-%20Test%20Test%2010&Test=Test%20Test"
脚本(混合了一些伪代码)如下:
for ( $i = 0 ; $i < 12 ; $i++ ) {
$img_log_path = "Set appropriate path according to the iteration";
open(IMG_HANDLE, $img_log_path) or "Could not open the file [=11=]\n";
while ( $line = <IMG_HANDLE> ) {
my $impression_date = `date --date="yesterday" +%Y-%m-%d`;
chomp $line;
if ( $line =~ m/&si=(\d+)/ig ) {
$affid = ; # Extract the affiliate ID from $line
@fields = split(/"/, $line);
$img_ref_url = $fields[13]; # Extract the impression URL from $line
$impression_urls{$affid}{$img_ref_url}{IMPRESSION_COUNT}++; # Store the beacon impression URLs along with a count of number of impressions
$img_ref_url =~ m/http:\/\/(.*?)\//i;
$impression_urls{$affid}{$img_ref_url}{IMPRESSION_URL_HOST} = ; # Store the hostname of the $img_ref_url
# Store the time at which first impression occurs for this URL.
$line =~ m/(\d\d:\d\d:\d\d)\s/ig;
my $impression_url_time = ;
my $impression_time = split( / /,$impression_urls{$affid}{$img_ref_url}{IMPRESSION_TIME} ); # If IMPRESSION_TIME exists for this URL from a previous impression then extract the time
# Store the time for the first impression
if($impression_url_time lt $impression_time || !defined($impression_urls{$affid}{$img_ref_url}{IMPRESSION_TIME})) {
$impression_urls{$affid}{$ref}{IMPRESSION_TIME} = "$impression_date "."$impression_url_time"; # Store the time at which the first impression happened for this URL.
}
if ( (defined($affid) && $affid ne "") && (defined($img_ref_url) && $img_ref_url ne "") ) {
$affiliates{$affid}{TOTAL_IMP}++; # Increment the total number of impressions for this URL.
if ( &CheckURL($img_ref_url) ) {
$affiliates{$affid}{ADULT_IMP}++;
}
}
}
}
close IMG_HANDLE;
}
for $aff_id (@aff_ids) { #@aff_ids contains all the affiliate ids from a database
my $impression_table = "tns_impressions";
foreach $impression_url (keys %{ $impression_urls{$aff_id} }) {
# If the pageurl doesn't exist then don't add it into the database
if ($impression_url =~ /-/) {
next;
}
# Insert the URL into the database
my $query = "INSERT INTO $impression_table (affiliate_id, impression_url, created_at, no_of_impressions, hostname) VALUES (?,?,?,?,?)";
my $statement = $dbh->prepare($query)
or print STDERR "$dbh->errstr";
$statement->execute(
$aff_id,
$impression_url,
$impression_urls{$aff_id}{$impression_url}{IMPRESSION_TIME},
$impression_urls{$aff_id}{$impression_url}{IMPRESSION_COUNT},
$impression_urls{$aff_id}{$impression_url}{IMPRESSION_URL_HOST}
) or print STDERR "$statement->errstr";
}
}
仅供参考,它的外观如何(未经测试):
for ( $i = 0; $i < 12; $i++ ) {
$img_log_path = "Set appropriate path according to the iteration";
open( my $fh, '<', $img_log_path )
or "[=10=]: Could not open the file $img_log_path: $!";
while ( $line = <$fh> ) {
if (my ( $impression_url_time, $affid, $ref_url, $host )
= $line =~ m,
\A \S+ \s+ # IP
\[ [^:]+ : ( \d\d:\d\d:\d\d ) \s [^\]] \] \s+ # date and time
(?: " [^"]* ){10} # use same skip as original split
" \S+ \s+ \S+? \& si= (\d+) [^"]+ " \s+ # affid from HTTP header line
" ( http:// ( [^/]+ ) / [^"]+ ) " # referential url and host
,x
)
{
my $rec = $impression_urls{$affid}{$ref_url} //= {
IMPRESSION_TIME => $impression_time,
IMPRESSION_URL_HOST => $host
};
$rec->{IMPRESSION_COUNT}++;
$rec->{IMPRESSION_TIME} = $impression_time
unless $rec->{IMPRESSION_TIME} le $impression_time;
$affiliates{$affid}{TOTAL_IMP}++;
$affiliates{$affid}{ADULT_IMP}++ if &CheckURL($ref_url);
}
}
}
my $date = `date --date="yesterday" +%Y-%m-%d`;
my $statement = do {
my $impression_table = "tns_impressions";
my $query
= "INSERT INTO `$impression_table` (affiliate_id, impression_url, created_at, no_of_impressions, hostname) VALUES (?,?,?,?,?)";
$dbh->prepare($query)
or print STDERR $dbh->errstr;
};
for $aff_id (@aff_ids)
{ #@aff_ids contains all the affiliate ids from a database
foreach my $impression_url ( keys %{ $impression_urls{$aff_id} } ) {
my $rec = $impression_urls{$aff_id}{$impression_url};
# Insert the URL into the database
$statement->execute(
$aff_id, $impression_url,
"$date $rec->{IMPRESSION_TIME}",
$rec->{IMPRESSION_COUNT},
$rec->{IMPRESSION_URL_HOST}
) or print STDERR $statement->errstr;
}
}
我的任务是修改一个 Perl 脚本,该脚本读取 12 个文件(每个约 1GB,约 400 万 entries/file)。问题是我不知道任何 perl。但是,我能够为我的用例成功修改脚本。问题是脚本需要大量时间来处理文件以及将条目插入数据库。如果 pointers/suggestions 减少时间,我们将不胜感激。
来自其中一个文件的示例条目行(修改以保护身份)如下:
10.0.0.25 [06/Aug/2015:06:00:02 +0000] "0.002" "200" "0.002" "172.16.2.57:7777" "-" "GET /txt/AKXBPPYICZIBGM/n1/19757_705326?dc=us2&ext_user_id=1400587512&si=149042592 HTTP/1.1" "http://xyz.xyz/site_view.xhtml?cmid=22211051&get-title=Test%20-%20Test%20Test%2010&Test=Test%20Test"
脚本(混合了一些伪代码)如下:
for ( $i = 0 ; $i < 12 ; $i++ ) {
$img_log_path = "Set appropriate path according to the iteration";
open(IMG_HANDLE, $img_log_path) or "Could not open the file [=11=]\n";
while ( $line = <IMG_HANDLE> ) {
my $impression_date = `date --date="yesterday" +%Y-%m-%d`;
chomp $line;
if ( $line =~ m/&si=(\d+)/ig ) {
$affid = ; # Extract the affiliate ID from $line
@fields = split(/"/, $line);
$img_ref_url = $fields[13]; # Extract the impression URL from $line
$impression_urls{$affid}{$img_ref_url}{IMPRESSION_COUNT}++; # Store the beacon impression URLs along with a count of number of impressions
$img_ref_url =~ m/http:\/\/(.*?)\//i;
$impression_urls{$affid}{$img_ref_url}{IMPRESSION_URL_HOST} = ; # Store the hostname of the $img_ref_url
# Store the time at which first impression occurs for this URL.
$line =~ m/(\d\d:\d\d:\d\d)\s/ig;
my $impression_url_time = ;
my $impression_time = split( / /,$impression_urls{$affid}{$img_ref_url}{IMPRESSION_TIME} ); # If IMPRESSION_TIME exists for this URL from a previous impression then extract the time
# Store the time for the first impression
if($impression_url_time lt $impression_time || !defined($impression_urls{$affid}{$img_ref_url}{IMPRESSION_TIME})) {
$impression_urls{$affid}{$ref}{IMPRESSION_TIME} = "$impression_date "."$impression_url_time"; # Store the time at which the first impression happened for this URL.
}
if ( (defined($affid) && $affid ne "") && (defined($img_ref_url) && $img_ref_url ne "") ) {
$affiliates{$affid}{TOTAL_IMP}++; # Increment the total number of impressions for this URL.
if ( &CheckURL($img_ref_url) ) {
$affiliates{$affid}{ADULT_IMP}++;
}
}
}
}
close IMG_HANDLE;
}
for $aff_id (@aff_ids) { #@aff_ids contains all the affiliate ids from a database
my $impression_table = "tns_impressions";
foreach $impression_url (keys %{ $impression_urls{$aff_id} }) {
# If the pageurl doesn't exist then don't add it into the database
if ($impression_url =~ /-/) {
next;
}
# Insert the URL into the database
my $query = "INSERT INTO $impression_table (affiliate_id, impression_url, created_at, no_of_impressions, hostname) VALUES (?,?,?,?,?)";
my $statement = $dbh->prepare($query)
or print STDERR "$dbh->errstr";
$statement->execute(
$aff_id,
$impression_url,
$impression_urls{$aff_id}{$impression_url}{IMPRESSION_TIME},
$impression_urls{$aff_id}{$impression_url}{IMPRESSION_COUNT},
$impression_urls{$aff_id}{$impression_url}{IMPRESSION_URL_HOST}
) or print STDERR "$statement->errstr";
}
}
仅供参考,它的外观如何(未经测试):
for ( $i = 0; $i < 12; $i++ ) {
$img_log_path = "Set appropriate path according to the iteration";
open( my $fh, '<', $img_log_path )
or "[=10=]: Could not open the file $img_log_path: $!";
while ( $line = <$fh> ) {
if (my ( $impression_url_time, $affid, $ref_url, $host )
= $line =~ m,
\A \S+ \s+ # IP
\[ [^:]+ : ( \d\d:\d\d:\d\d ) \s [^\]] \] \s+ # date and time
(?: " [^"]* ){10} # use same skip as original split
" \S+ \s+ \S+? \& si= (\d+) [^"]+ " \s+ # affid from HTTP header line
" ( http:// ( [^/]+ ) / [^"]+ ) " # referential url and host
,x
)
{
my $rec = $impression_urls{$affid}{$ref_url} //= {
IMPRESSION_TIME => $impression_time,
IMPRESSION_URL_HOST => $host
};
$rec->{IMPRESSION_COUNT}++;
$rec->{IMPRESSION_TIME} = $impression_time
unless $rec->{IMPRESSION_TIME} le $impression_time;
$affiliates{$affid}{TOTAL_IMP}++;
$affiliates{$affid}{ADULT_IMP}++ if &CheckURL($ref_url);
}
}
}
my $date = `date --date="yesterday" +%Y-%m-%d`;
my $statement = do {
my $impression_table = "tns_impressions";
my $query
= "INSERT INTO `$impression_table` (affiliate_id, impression_url, created_at, no_of_impressions, hostname) VALUES (?,?,?,?,?)";
$dbh->prepare($query)
or print STDERR $dbh->errstr;
};
for $aff_id (@aff_ids)
{ #@aff_ids contains all the affiliate ids from a database
foreach my $impression_url ( keys %{ $impression_urls{$aff_id} } ) {
my $rec = $impression_urls{$aff_id}{$impression_url};
# Insert the URL into the database
$statement->execute(
$aff_id, $impression_url,
"$date $rec->{IMPRESSION_TIME}",
$rec->{IMPRESSION_COUNT},
$rec->{IMPRESSION_URL_HOST}
) or print STDERR $statement->errstr;
}
}