正则表达式使用 look behinds 解析 Buffy Script

Regex parse Buffy Script using look behinds

我很难解析此页面:http://www.buffyworld.com/buffy/transcripts/114_tran.html

我正在尝试获取相关对话中的角色名称。 文本如下所示:

<p>BUFFY: Wait!
<p>She stands there panting, watching the truck turn a corner.
<p>BUFFY: (whining) Don't you want your garbage?
<p>She sighs, pouts, turns and walks back toward the house.
<p>Cut to the kitchen. Buffy enters through the back door, holding a pile of
mail. She begins looking through it. We see Dawn standing by the island.
<p>DAWN: Hey Buffy. Oh, don't forget, today's trash day.<br>BUFFY: (sourly)
Thanks.
<p>Dawn piles her books into her school bag. Buffy opens a letter.
<p>Close shot of the letter.
<p>
<p>Dawn smiles, and she and Willow exit. Buffy picks up the still-wrapped
sandwich and stares at it.
<p>BUFFY: (to herself) Somebody should.
<p>She sighs, puts the sandwich back in the bag.
<p>Cut to the Bronze. Pan across various people drinking and dancing,
bartender serving. Reveal Xander and Anya sitting at the bar eating chips from
several bags. A notebook sits in front of them bearing the wedding seating
chart.
<p>ANYA: See ... this seating chart makes no sense. We have to do it again.
(Xander nodding) We can't do it again. You do it.<br>XANDER: The seating
chart's fine. Let's get back to the table arrangements. I'm starting to have
dreams of gardenia bouquets. (winces) I am so glad my manly coworkers didn't
just hear me say that. (eating chips)

理想情况下,我会将 <p><br> 匹配到下一个 <p><br>。为此,我尝试使用前瞻和后视:

reg = "((?<=<p>)|(?<=<br>))(?P<character>.+):(?P<dialogue>.+)((?=<p>)|(?=<br>))"
script = re.findall(reg, html_text)

很遗憾,这与任何内容都不匹配。当我离开前瞻 ((?=<p>)|(?=<br>)) 时,只要匹配对话中没有换行符,我就会匹配行。它似乎在换行符处终止,而不是继续到 <p>

例如。在这一行中,"Thanks" 不匹配。 <p>DAWN: Hey Buffy. Oh, don't forget, today's trash day.<br>BUFFY: (sourly) Thanks.

感谢您的任何见解!

绕过点符号:

re.findall('((?<=<p>)|(?<=<br>))([A-Z]+):([^<]+)', text)

您也可以尝试 special flag 将换行符包含到点的语义中。就个人而言,当我可以使用拆分或某些 html 解析器时。 RE 转义,所有参数、限制和标志都可以让任何人发疯。还有re.split.

dialogs = {}
text = html_text.replace('<br>', '<p>')
paragraphs = text.split('<p>')

for p in paragraphs:
    if ":" in p:
        char, line = p.split(":", 1)
        if char in dialogs:
           dialogs[char].append(line)
        else:
           dialogs[char] = []