如何从 perl 中的文本中提取引用的部分?

How do I extract quoted portions from a text in perl?

例如,来自这样的文本:

到 1984 年,迪伦开始远离“重生”标签。他告诉滚石杂志的库尔特·洛德:“我从来没有说过我重生了。那只是一个媒体术语。我不认为我是不可知论者。我一直认为有一种更强大的力量,那就是这不是真实的世界,还有一个未来的世界。"

我要提取:

文本本身显然没有固定数量的引用,因此解决方案需要提取所有引用部分。

我试过 Text::Balanced 这样的:

extract_delimited($text, "\""); 

在一个循环中,但我什至无法提取“重生”——这将是一个好的开始。

Text::Balanced 是正确的工具吗?我哪里错了?

如果您不需要处理引号中的引号之类的东西,Text::Balanced 可能有点过头了。

假设 " 字符位于字符串的开头或前面有 space 将打开引号,下一个 " 位于字符串的末尾字符串,或者后面跟一个非单词字符将结束引号,那么 /(?:\s|\A)(\".+?\")(?:\W|\z)/sm 应该捕获带引号的字符串,包括引号。

添加 /g 修饰符以捕获所有引号,您将得到:

use strict;
use warnings;
use Data::Dumper;

my $data = <<'DATA';
By 1984, Dylan was distancing himself from the "born again" label. He told
Kurt Loder of Rolling Stone magazine: "I've never said I'm born again.
That's just a media term. I don't think I've been an agnostic. I've always
thought there's a superior power, that this is not the real world and that
there's a world to come."
DATA

my @quoted_parts = ( $data =~ /(?:\s|\A)(\".+?\")(?:\W|\z)/gsm );

print Dumper \@quoted_parts;

Text::Balanced 当您需要处理不同的方括号时很有用,例如,可能嵌套为“( [ ( ) ] )”,并且您需要确保匹配正确的结尾方括号使用正确的起始括号。当您希望引号能够包含转义的引号字符时,它很有用。之类的东西。它实际上是用于处理 XML、JSON、编程语言、配置文件等形式的形式语言。不用于解析自然语言。

只是因为您尝试了 Text::Balanced 但没有成功 - 也许您想要

#!/usr/bin/env perl

use Data::Dumper;
use Params::Validate qw(:all);
use Text::Balanced qw(extract_delimited extract_multiple);
use 5.01800;
use warnings;

    sub dump_stringsQuoted { # Dumps quoted strings
        my ($text_S)=validate_pos(@_,{ type=>SCALAR });
        warn Data::Dumper->Dump([$text_S],[qw(*text)]),' ';;

        for (extract_multiple($text_S, [sub {extract_delimited($_[0],q{"})}], undef, 1)) {
            say $_;
            };
         }; # dump_stringsQuoted:

local $/;
dump_stringsQuoted(<DATA>);
__DATA__
By 1984, Dylan was distancing himself from the "born again" label. He told Kurt
Loder of Rolling Stone magazine: "I've never said I'm born again. That's just a
media term. I don't think I've been an agnostic. I've always thought there's a
superior power, that this is not the real world and that there's a world to come."

产生

duh >perl TB.pl
$text = \'By 1984, Dylan was distancing himself from the "born again" label. He told Kurt
Loder of Rolling Stone magazine: "I\'ve never said I\'m born again. That\'s just a
media term. I don\'t think I\'ve been an agnostic. I\'ve always thought there\'s a
superior power, that this is not the real world and that there\'s a world to come."';
  at TB.pl line 11, <DATA> chunk 1.
"born again"
"I've never said I'm born again. That's just a
media term. I don't think I've been an agnostic. I've always thought there's a
superior power, that this is not the real world and that there's a world to come."