使用 WWW::Mechanize 执行重复查询时内存泄漏

Question

我试图在我的程序中查找内存泄漏。我找到了泄漏的源头，但我无法修复它。

该程序读取与维基百科上找到的每条染色体相连的每个基因页面 Genes by human chromosome

程序在每个基因页面上提取我感兴趣的信息，移动到下一个基因页面等等。

一旦到达当前染色体基因列表的末尾，它就会移动到下一条染色体，直到它遍历每一页。

大约 2-3 周前，该程序在我的计算机上运行。从那以后就开始出现这个问题了。

我一直在使用 top 进行监控，随着程序的运行，内存使用量明显增加，直到达到临界点并且我的计算机崩溃。

根据要求，我提供了可以编译的代码。我从 21 号染色体开始，因为它的页数最少，因此通读所需的时间最少。在此代码片段中，内存使用量仍在递增，所以希望这就足够了！此外，eval 语句在那里，因为查询维基百科 API 有时 returns 什么都没有，而不是预期的 JSON。 eval 函数允许我在不让程序死掉的情况下解决这个问题

我的（更新的）代码

#!/usr/bin/env perl -w

use common::sense;
use WWW::Mechanize;
use URI;
use HTTP::Request;
use Cpanel::JSON::XS qw(decode_json);

my ( $self, $registry ) = @_;

my $mech = WWW::Mechanize->new();

my $root = URI->new("http://en.wikipedia.org/w/api.php");

my $url = $root->clone();

for my $i ( 21 .. 25 ) {
    my $chrom = $i;
    if ( $chrom == 23 ) {
        $chrom = "M";
    }
    elsif ( $chrom == 24 ) {
        $chrom = "Y";
    }
    elsif ( $chrom == 25 ) {
        $chrom = "X";
    }
    print "Hi!\n The chromosome is $chrom\n";

    my $query = {
        action     => 'query',
        format     => 'json',
        list       => 'categorymembers',
        cmtitle    => "Category:Genes on human chromosome $chrom",
        cmlimit    => 'max',
        cmcontinue => ''
    };

    $url->query_form($query);

    my @gene_pages = ();
    eval {
        while ( my $response = $mech->get($url) ) {
            my $perl_scalar = decode_json( $response->decoded_content() )
                ;    #J Source of malformed JSON string error
            push @gene_pages, @{ $perl_scalar->{query}->{categorymembers} };
            my $count = @gene_pages;

            # Adapted code to new format for continuing queries

            if ( $perl_scalar->{continue} ) {
                $query->{cmcontinue} = $perl_scalar->{continue}->{cmcontinue};
                $url->query_form($query);
            }
            else {
                last;
            }
        }
    };
    if ( $@ =~ /malformed/ ) {
        redo;
    }
    my $gene_count = 0;
    eval {
        foreach my $gene_page (@gene_pages) {
            $gene_count++;
            my $url   = $root->clone();
            my $query = {
                action  => 'query',
                prop    => 'revisions',
                format  => 'json',
                rvprop  => 'content|tags|timestamp',
                pageids => $gene_page->{pageid}
            };
            $url->query_form($query);

            #       $log->debugf("Requesting: %s", $url->as_string());
            my $response    = $mech->get($url);
            my $content     = $response->decoded_content();
            my $perl_scalar = decode_json( $response->decoded_content() )
                ;    #J Source of malformed JSON string error
            if ( $gene_count % 10 == 0 ) {
                print "$gene_count gene pages complete\n";
            }
        }
    };
    print "There were $gene_count genes found for chromosome $chrom\n";

}

这段代码的组成部分要大得多，但我将其排除在外，因为我知道这是问题根源所在的区域。

使用WWW::Mechanize

的while循环部分

my $response = $mech->get($url)

与内存泄漏有关。

如果我删除该组件并且运行内存使用的程序保持大致相同然后将其添加回去显示内存再次增加。

Perl 版本：5.24.1

系统：Ubuntu16.04

编辑：@Borodin 感谢您如此详尽的回复！不幸的是，我仍然注意到我的计算机上存在内存泄漏，这让我想知道是否还有更大的问题。

它仍然会逐渐占用内存，目前我的电脑还可以，但是当我运行包含一些网络抓取的完整程序时，我不知道我的电脑是否足够。

关于一个可能相关的说明 -- 我的计算机有一个奇怪的问题，它有时无法完全下载文件（尽管下载已完成，但文件仍未运行分类）。当我运行ning 你的程序时，我经常遇到这个错误：

** 解析 JSON 字符串时字符串意外结束，字符偏移量为 5506（在“（字符串结束）”之前）**

这似乎与我遇到的问题有关，我想知道这是否会导致内存泄漏问题？

Answer 1

您不使用 WWW::Mechanize 中 LWP::UserAgent 未提供的任何部分，因此我建议您推迟到后者

这是一些工作代码，其功能与您自己的程序几乎相同。它对我来说没有任何内存泄漏

请询问您是否需要任何解释；整个程序内容太多

#!/usr/bin/env perl

use strict;
use warnings 'all';

use URI;
use URI::QueryParam;
use LWP;
use JSON::XS qw(decode_json);

STDOUT->autoflush;

my $api_root = URI->new( 'http://en.wikipedia.org/w/api.php' );

my @chromosomes = ( 1 .. 22, qw/ M Y X/ );

my $ua = LWP::UserAgent->new;

for my $chrom ( @chromosomes[20..$#chromosomes] ) {

    #print "The chromosome is $chrom\n";

    my $query = {
        action  => 'query',
        format  => 'json',
        list    => 'categorymembers',
        cmtitle => "Category:Genes on human chromosome $chrom",
        cmlimit => 'max',
    };

    my $url = $api_root->clone;
    $url->query_form( $query );

    my @gene_pages;

    while () {

        my $resp = $ua->get( $url );
        die $resp->status_line unless $resp->is_success;

        # J Source of malformed JSON string error
        my $data     = decode_json( $resp->decoded_content );
        my $query    = $data->{query};
        my $continue = $data->{continue};

        push @gene_pages, @{ $query->{categorymembers} };

        # Adapted code to new format for continuing queries
        last unless $continue;

        $url->query_param( cmcontinue => $continue->{cmcontinue} );
    }

    printf "Processing %d gene pages for chromosome %s\n",
            scalar @gene_pages,
            $chrom;

    my $gene_count;

    for my $gene_page ( @gene_pages ) {

        ++$gene_count;

        my $url = $api_root->clone;

        my $query = {
            action  => 'query',
            prop    => 'revisions',
            format  => 'json',
            rvprop  => 'content|tags|timestamp',
            pageids => $gene_page->{pageid}
        };

        $url->query_form( $query );

        # print "Requesting: $url\n";

        my $resp = $ua->get( $url );

        die $resp->status_line unless $resp->is_success;

        my $content = $resp->decoded_content;
        my $data    = decode_json( $content );    # J Source of malformed JSON string error

        print "$gene_count gene pages complete\n" unless $gene_count % 10;
    }

    print "There were $gene_count genes found for chromosome $chrom\n";
}

使用 WWW::Mechanize 执行重复查询时内存泄漏

Memory leak when performing repeated queries using WWW::Mechanize

perl

memory-leaks

www-mechanize

wikipedia-api

我的（更新的）代码