Python re.findall() 返回空列表

Python re.findall() returning empty list

我正在尝试用正则表达式匹配一些单词,并为此编写了一个 python 代码。奇怪的是 re.findall() 在匹配时返回空列表。但是,模式和文本文件在 regxr.com 中显示匹配。这是代码

pat1 = '(\S+)_(?:JJ)_\S+\b(?:\s+)(\S+)_(?:NN|NNS)_\S+\b'
pat2 = '(\S+?)_(?:RR|RBR|RBS)_\S+\b(?:\s+)(\S+?)_(?:JJ)_\S+\b(?:\s+)(?!\S*?_(?:NN|NNS)_\S+\b)'
pat3 = '(\S+?)_(?:JJ)_\S+\b(?:\s+)(\S+?)_(?:JJ)_\S+\b(?:\s+)(?!\S*?_(?:NN|NNS)_\S+\b)'
pat4 = '(\S+?)_(?:NN|NNS)_\S+\b(?:\s+)(\S+?)_(?:JJ)_\S+\b(?:\s+)(?!\S*?_(?:NN|NNS)_\S+\b)'
pat5 = '(\S+?)_(?:RB|RBR|RBS)_\S+\b(?:\s+)(\S+?)_(?:VB|VBD|VBN|VBG)_\S+\b(?:\s+)\S*?_\S+?_\S+\b'

def process_file(content):
res = []
for line in content:
    matches = re.findall(pat1,line)
    for m in matches:
        m = (m[0],m[1])
        phrase = '%s %s' % m
        res.append(phrase)
    matches = re.findall(pat2,line)
    for m in matches:
        m = (m[0],m[1])
        phrase = '%s %s' % m
        res.append(phrase)
    matches = re.findall(pat3,line)
    for m in matches:
        m = (m[0],m[1])
        phrase = '%s %s' % m
        res.append(phrase)
    matches = re.findall(pat4,line)
    for m in matches:
        m = (m[0],m[1])
        phrase = '%s %s' % m
        res.append(phrase)
    matches = re.findall(pat5,line)
    for m in matches:
        m = (m[0],m[1])
        phrase = '%s %s' % m
        res.append(phrase)
return res

def main(path):
   contents = []
   f = open(path)
   for line in f:
      contents.append(line)
   f.close()
   result = process_file(contents) 
   print result

这是我正在使用的文本文件:

sydney_NN_B-NP lumet_NN_I-NP is_VBZ_B-VP the_DT_B-NP director_NN_I-NP whose_WP$_B-NP work_NN_I-NP happens_VBZ_B-VP to_TO_I-VP be_VB_I-VP of_IN_B-PP varied_VBN_B-NP quality_NN_I-NP ._._B-O he_PRP_B-NP is_VBZ_B-VP praised_VBN_I-VP for_IN_B-PP some_DT_B-NP of_IN_B-PP the_DT_B-NP most_RBS_I-NP important_JJ_I-NP films_NNS_I-NP of_IN_B-PP the_DT_B-NP previous_JJ_I-NP decades_NNS_I-NP ,_,_B-O like_IN_B-PP twelve_CD_B-NP angry_JJ_I-NP men_NNS_I-NP ,_,_B-O serpico_NN_B-NP or_CC_B-O the_DT_B-NP verdict_NN_I-NP ._._B-O but_CC_B-O ,_,_I-O in_IN_B-PP the_DT_B-NP same_JJ_I-NP time_NN_I-NP ,_,_B-O almost_RB_B-NP any_DT_I-NP of_IN_B-PP such_JJ_B-NP pearls_NNS_I-NP is_VBZ_B-VP followed_VBN_I-VP by_IN_B-PP stinkers_NNS_B-NP that_WDT_B-NP hamper_VBP_B-VP lumet's_JJ_B-NP reputation_NN_I-NP ._._B-O a_DT_B-NP stranger_NN_I-NP among_IN_B-PP us_PRP_B-NP ,_,_B-O 1992_CD_B-NP rip-off_NN_I-NP of_IN_B-PP peter_NN_B-NP weir's_JJ_I-NP witness_NN_I-NP ,_,_B-O belongs_VBZ_B-VP to_TO_B-PP the_DT_B-NP latter_NN_I-NP category_NN_I-NP ._._B-O the_DT_B-NP heroine_NN_I-NP of_IN_B-PP this_DT_B-NP movie_NN_I-NP is_VBZ_B-VP emily_JJ_B-NP eden_FW_I-NP (_(_B-O melanie_JJ_B-NP griffith_NN_I-NP )_)B-O ,,_I-O tough_JJ_B-NP lady_NN_I-NP cop_NN_I-NP who_WP_B-NP sometimes_RB_B-ADVP shows_VBZ_B-VP too_RB_B-NP much_JJ_I-NP enthusiasm_NN_I-NP in_IN_B-PP battling_VBG_B-VP bad_JJ_B-NP guys_NNS_I-NP on_IN_B-PP the_DT_B-NP streets_NNS_I-NP of_IN_B-PP new_JJ_B-NP york_NN_I-NP ._._B-O during_IN_B-PP one_CD_B-NP of_IN_B-PP such_JJ_B-NP actions_NNS_I-NP ,_,_B-O her_PRP$_B-NP partner_NN_I-NP nick_NN_I-NP (_(_B-O jamey_JJ_B-NP sheridan_NNS_I-NP )_)_B-O got_VBD_B-VP hurt_VBN_I-VP and_CC_B-O as_IN_B-PP a_DT_B-NP result_NN_I-NP ,_,_B-O she_PRP_B-NP becomes_VBZ_B-VP depressed_JJ_B-ADJP ._._B-O in_IN_B-PP order_NN_B-NP to_TO_B-VP help_VB_I-VP her_PRP_B-NP recover_VB_B-VP ,_,_B-O bosses_NNS_B-NP give_VBP_B-VP her_PRP_B-NP rather_RB_I-NP easy_JJ_I-NP task_NN_I-NP of_IN_B-PP locating_VBG_B-VP missing_VBG_B-NP jeweller_NNS_I-NP who_WP_B-NP belonged_VBD_B-VP to_TO_B-PP hassidic_JJ_B-NP jew_NN_I-NP community_NN_I-NP ._._B-O emily_NN_B-NP starts_VBZ_B-VP investigation_NN_B-NP and_CC_B-O soon_RB_B-VP realises_VBZ_I-VP that_IN_B-SBAR the_DT_B-NP case_NN_I-NP involves_VBZ_B-VP murder_NN_B-NP ._._B-O concluding_VBG_B-VP that_IN_B-SBAR the_DT_B-NP perpetrator_NN_I-NP belongs_VBZ_B-VP to_TO_B-PP community_NN_B-NP ,_,_B-O she_PRP_B-NP decides_VBZ_B-VP to_TO_I-VP go_VB_I-VP undercover_JJ_B-ADJP ._._B-O that_DT_B-NP isn't_RB_B-O easy_JJ_B-ADJP ,_,_B-O because_IN_B-SBAR her_PRP$_B-NP modern_JJ_I-NP manners_NNS_I-NP are_VBP_B-VP colliding_VBG_I-VP with_IN_B-PP traditionalist_NN_B-NP ways_NNS_I-NP ._._B-O things_NNS_B-NP get_VBP_B-VP even_RB_B-NP more_RBR_B-ADJP complicated_JJ_I-ADJP when_WRB_B-ADVP she_PRP_B-NP develops_VBZ_B-VP feelings_NNS_B-NP for_IN_B-PP young_JJ_B-NP cabalistic_JJ_I-NP scholar_NN_I-NP ariel_NN_I-NP (_(_B-O eric_JJ_B-NP thal_NN_I-NP )_)B-O .._I-O using_VBG_B-VP peter_NN_B-NP weir's_JJ_I-NP formula_NN_I-NP isn't_:_B-O the_DT_B-NP greatest_JJS_I-NP flaw_NN_I-NP of_IN_B-PP this_DT_B-NP film_NN_I-NP ._._B-O even_RB_B-NP the_DT_I-NP lame_JJ_I-NP and_CC_I-NP unispiring_JJ_I-NP crime_NN_I-NP mystery_NN_I-NP subplot_NN_I-NP works_VBZ_B-VP to_TO_B-PP the_DT_B-NP certain_JJ_I-NP extent_NN_I-NP ._._B-O but_CC_B-O the_DT_B-NP worst_JJS_I-NP insult_NN_I-NP to_TO_B-PP viewer's_JJ_B-NP audience_NN_I-NP is_VBZ_B-VP terrible_JJ_B-NP miscasting_NN_I-NP of_IN_B-PP melanie_JJ_B-NP griffith_NN_I-NP ._._B-O the_DT_B-NP author_NN_I-NP of_IN_B-PP this_DT_B-NP review_NN_I-NP never_RB_B-ADVP liked_VBD_B-VP this_DT_B-NP actress_NN_I-NP very_RB_B-ADVP much_RB_I-ADVP ,_,_B-O but_CC_I-O she_PRP_B-NP was_VBD_B-VP at_IN_B-ADVP least_JJS_I-ADVP tolerable_JJ_B-ADJP in_IN_B-PP some_DT_B-NP of_IN_B-PP her_PRP$_B-NP roles_NNS_I-NP ._._B-O role_NN_B-NP of_IN_B-PP emily_JJ_B-NP eden_NNS_I-NP ,_,_B-O unfortunately_RB_B-ADVP ,_,_B-O isn't_VBZ_I-O one_CD_B-NP of_IN_B-PP them_PRP_B-NP ._._B-O first_RB_B-ADVP of_IN_B-PP all_DT_B-NP ,_,_B-O she_PRP_B-NP can't_MD_B-VP pass_VB_I-VP for_IN_B-PP tough_JJ_B-NP nypd_JJ_I-NP street_NN_I-NP fighter_NN_I-NP ,_,_B-O and_CC_I-O her_PRP$_B-NP attempt_NN_I-NP to_TO_B-VP pass_VB_I-VP for_IN_B-PP orthodox_JJ_B-NP jewish_JJ_I-NP woman_NN_I-NP isn't_RB_B-O much_RB_B-ADJP better_JJR_I-ADJP ._._B-O screenplay_NN_B-NP by_IN_B-PP robert_JJ_B-NP j_NN_I-NP ._._B-O avrech_NNS_B-NP makes_VBZ_B-VP things_NNS_B-NP even_RB_B-ADJP worse_JJR_I-ADJP with_IN_B-PP some_DT_B-NP formulaic_JJ_I-NP red_JJ_I-NP herring_NN_I-NP subplots_NNS_I-NP (_(_B-O scene_NN_B-NP involving_VBG_B-VP two_CD_B-NP italian_JJ_I-NP gangsters_NNS_I-NP was_VBD_B-VP almost_RB_B-ADJP too_RB_I-ADJP painful_JJ_I-ADJP to_TO_B-VP watch_VB_I-VP )_)B-O .._I-O but_CC_B-O ,_,_I-O on_IN_B-PP the_DT_B-NP other_JJ_I-NP hand_NN_I-NP ,_,_B-O other_JJ_B-NP actors_NNS_I-NP are_VBP_B-VP more_RBR_B-ADJP convincing_JJ_I-ADJP (_(_B-O lee_NN_B-NP richardson_NN_I-NP as_IN_B-PP an_DT_B-NP old_JJ_I-NP rabbi_NN_I-NP ,_,_B-O thal_JJ_B-ADJP as_IN_B-PP ariel_NN_B-NP and_CC_B-O charming_JJ_B-NP mia_NN_I-NP sara_NN_I-NP as_IN_B-PP his_PRP$_B-NP intended_VBN_I-NP bride_NN_I-NP )_)B-O ,,_I-O and_CC_I-O the_DT_B-NP photography_NN_I-NP by_IN_B-PP andrzej_JJ_B-NP bartkowiak_NN_I-NP very_RB_B-ADVP effectively_RB_I-ADVP creates_VBZ_B-VP atmosphere_NN_B-NP of_IN_B-PP warmth_NN_B-NP when_WRB_B-ADVP the_DT_B-NP scenes_NNS_I-NP take_VBP_B-VP place_NN_B-NP in_IN_B-PP hassidic_JJ_B-NP community_NN_I-NP ._._B-O also_RB_B-ADVP ,_,_B-O the_DT_B-NP film_NN_I-NP might_MD_B-VP educate_VB_I-VP viewers_NNS_B-NP about_IN_B-PP hassidic_JJ_B-NP culture_NN_I-NP ._._B-O that_DT_B-NP is_VBZ_B-VP the_DT_B-NP only_JJ_I-NP thing_NN_I-NP that_WDT_B-NP prevents_VBZ_B-VP it_PRP_B-NP from_IN_B-PP turning_VBG_B-VP into_IN_B-PP total_JJ_B-NP waste_NN_I-NP of_IN_B-PP time_NN_B-NP ._._B-O

你被反斜杠咬了!反斜杠用作 Python 字符串中的转义字符(与许多其他语言一样)。例如,\n 表示 "newline",\r 表示 "carriage return"...而 \b 表示 "backspace",又名 \x08

而且你的所有表达式中都有 \b

所以当你写的时候:

>>> pat1 = '...\b...'

你得到:

>>> pat1
'...\x08...'

有两种方法可以解决这个问题。您可以使用另一个反斜杠转义每个反斜杠,如下所示:

>>> pat1 = '...\b...'
>>> pat1
'...\b...'

请注意,您在那里看到了 \,因为那是字符串的 Python 表示;如果我们打印出 pat1 我们得到:

>>> print pat1
...\b...

更简单的解决方法是将正则表达式字符串标记为 "raw strings":

The backslash () character is used to escape characters that otherwise have a special meaning, such as newline, backslash itself, or the quote character. String literals may optionally be prefixed with a letter r' orR'; such strings are called raw strings and use different rules for backslash escape sequences.

换句话说:

pat1 = r'(\S+)_(?:JJ)_\S+\b(?:\s+)(\S+)_(?:NN|NNS)_\S+\b'
pat2 = r'(\S+?)_(?:RR|RBR|RBS)_\S+\b(?:\s+)(\S+?)_(?:JJ)_\S+\b(?:\s+)(?!\S*?_(?:NN|NNS)_\S+\b)'
pat3 = r'(\S+?)_(?:JJ)_\S+\b(?:\s+)(\S+?)_(?:JJ)_\S+\b(?:\s+)(?!\S*?_(?:NN|NNS)_\S+\b)'
pat4 = r'(\S+?)_(?:NN|NNS)_\S+\b(?:\s+)(\S+?)_(?:JJ)_\S+\b(?:\s+)(?!\S*?_(?:NN|NNS)_\S+\b)'
pat5 = r'(\S+?)_(?:RB|RBR|RBS)_\S+\b(?:\s+)(\S+?)_(?:VB|VBD|VBN|VBG)_\S+\b(?:\s+)\S*?_\S+?_\S+\b'

有了这个改变,我使用你的样本数据得到了匹配:

>>> re.findall(pat1, data)
[('important', 'films'), ('previous', 'decades'), ('angry', 'men'), ('same', 'time'), ('such', 'pearls'), ("lumet's", 'reputation'), ("weir's", 'witness'), ('melanie', 'griffith'), ('tough', 'lady'), ('much', 'enthusiasm'), ('bad', 'guys'), ('new', 'york'), ('such', 'actions'), ('jamey', 'sheridan'), ('easy', 'task'), ('hassidic', 'jew'), ('modern', 'manners'), ('cabalistic', 'scholar'), ('eric', 'thal'), ("weir's", 'formula'), ('unispiring', 'crime'), ('certain', 'extent'), ("viewer's", 'audience'), ('terrible', 'miscasting'), ('melanie', 'griffith'), ('emily', 'eden')]