如何使用正则表达式从收入通话记录中提取(发言人,文本)元组?

How to extract (speaker, text) tuples from earning call transcripts with regex?

对于我的硕士论文,我需要从公司收入电话会议记录中提取(演讲者、文本)元组。

成绩单采用以下形式:

OPERATOR: Some text with numbers, special characters and linebreaks.

NAME, COMPANY, POSITION: Some text with numbers, special characters and linebreaks.

NAME: Some text with numbers, special characters and linebreaks.

我想从文档中提取所有(说话人、文本)元组。例如:

[("OPERATOR", "Some text with numbers, special characters and linebreaks."), ..]

到目前为止,我在Python.

中用re.findall函数尝试了不同的正则表达式

这是一个示例摘录:

example = """OPERATOR: Good day, ladies and gentlemen, and welcome to the first-quarter 2012
Agilent Technologies earnings conference call. My name is Keith, and I will be
your operator for today. At this time, all participants are in a listen-only
mode. Later on, we will have a question and answer session. (Operator
Instructions) As a reminder, today's conference is being recorded for replay
purposes.

And I would now like to turn the conference over to your host for today, Ms.
Alicia Rodriguez, Vice President of Investor Relations. Please go ahead, ma'am.

ALICIA RODRIGUEZ, VP - IR, AGILENT TECHNOLOGIES INC: Thank you, Keith, and
welcome, everyone, to Agilent's first quarter conference call for fiscal-year
2012. With me are Agilent's President and CEO, Bill Sullivan, as well as Senior
Vice President and CFO, Didier Hirsch. Joining in the Q&A after Didier's
comments will be Agilent's Chief Operating Officer, Ron Nersesian, and the
Presidents of our Electronic Measurement, Life Sciences, and Chemical Analysis
Groups -- Guy Sene, Nick Roelofs, and Mike McMullen.

You can find the press release and information to supplement today's discussion
on our website at www.investor.agilent.com. While there, please click on the
link for financial results, where you will find revenue breakouts and historical
financials for Agilent's operations. We will also post a copy of the prepared
remarks following this call. For any non-GAAP financial measures, you will find
the most directly comparable GAAP financial metrics and reconciliations on our
website.

We will make forward-looking statements about the financial performance of the
Company. These statements are subject to risks and uncertainties, and are only
valid as of today. The Company assumes no obligation to update them. Please look
at the Company's recent SEC filings for a more complete picture of our risks and
other factors.

Before turning the call over to Bill, I would like to remind you that Agilent
will host its annual analysts meeting in New York City on March 8. Details about
the meeting and webcast will be available on the Agilent investor relations
website two weeks prior.

And now, I'd like to turn the call over to Bill.

BILL SULLIVAN, PRESIDENT AND CEO, AGILENT TECHNOLOGIES INC: Thanks, Alicia, and
hello, everyone. Agilent's Q1 orders of .62 billion were flat versus last
year. Q1 revenues of .64 billion were up 7% year-over-year. Non-GAAP EPS was
[=12=].69 per share, and operating margin was 19%."""

这是我的代码:

import re

# First approach:
r = re.compile(r"^([^a-z:]+?):([\s\S]+?)", flags=re.MULTILINE)
re.findall(r, example)

# Second approach:
r = re.compile(r"^([^a-z:]+?):([\s\S]+)", flags=re.MULTILINE)
re.findall(r, example)

第一种(non-greedy)方法的问题是它没有捕获说话者的全文。

第二种(贪心)方法的问题是下一个发言者出现时它不会停止。

编辑:附加信息

您可以在不使用 [\s\S]+ 的情况下进行匹配,因为它将匹配任何字符,包括换行符。

对于第二个捕获组,您可以匹配 .*,然后使用具有负前瞻性的重复组,只要以下行不以 (?:(?!\n[^a-z\r\n]+:)[=15= 开头,它就会匹配]

^([^a-z\r\n]+):(.*(?:(?!\n[^a-z\r\n]+:)[\r\n].*)*)

Regex demo | Python demo