如何捕获格式为 (name): (sentence)\n(name) 的文件中的所有句子：

Question

我有成绩单文件，格式为

(name): (sentence)\n (<-- There can be multiples of this pattern)

(name): (sentence)\n
(sentence)\n

等等。我需要所有的句子。到目前为止，我已经通过对文件中的名称进行硬编码来使其工作，但我需要它是通用的。

utterances = re.findall(r'(?:CALLER: |\nCALLER:\nCRO: |\nCALLER:\nOPERATOR: |\nCALLER:\nRECORDER: |RECORDER: |CRO: |OPERATOR: )(.*?)(?:CALLER: |RECORDER : |CRO: |OPERATOR: |\nCALLER:\n)', raw_calls, re.DOTALL)

Python 3.6 使用重新。或者，如果有人知道如何使用 spacy 做到这一点，那将是一个很大的帮助，谢谢。

我只想在一个空语句之后抓取 \n，并将其放入自己的字符串中。我想我只需要获取结尾给出的磁带信息，例如，因为我想不出一种方法来区分这条线是否是某人演讲的一部分。也有的时候，行首和冒号之间有多个单词。

模拟数据：

CRO: How far are you from the World Trade Center, how many blocks, about? Three or four blocks?

63FDNY 911 Calls Transcript - EMS - Part 1 9-11-01

CALLER:

CRO: You're welcome. Thank you.

OPERATOR: Bye.

CRO: Bye.

RECORDER: The preceding portion of tape concludes at 0913 hours, 36 seconds.

This tape will continue on side B.

OPERATOR NEWELL: blah blah.

Answer 1

您从未给过我们模拟数据，因此我使用以下内容进行测试：

name1: Here is a sentence.
name2: Here is another stuff: sentence
which happens to have two lines
name3: Blah.

我们可以尝试使用以下模式进行匹配：

^\S+:\s+((?:(?!^\S+:).)+)

这可以解释为：

^\S+:\s+           match the name, followed by colon, followed by one or more space
((?:(?!^\S+:).)+)  then match and capture everything up until the next name

请注意，这处理的是最后一句话的边缘情况，因为上面使用的否定前瞻是不正确的，因此所有剩余的内容都将被捕获。

代码示例：

import re
line = "name1: Here is a sentence.\nname2: Here is another stuff: sentence\nwhich happens to have two lines\nname3: Blah."
matches = re.findall(r'^\S+:\s+((?:(?!^\S+:).)+)', line, flags=re.DOTALL|re.MULTILINE)
print(matches)

['Here is a sentence.\n', 'Here is another stuff: sentence\nwhich happens to have two lines\n', 'Blah.']

Demo

Answer 2

您可以使用前瞻表达式在行首查找名称的相同模式并后跟一个冒号：

s = '''CRO: How far are you from the World Trade Center, how many blocks, about? Three or four blocks?
63FDNY 911 Calls Transcript - EMS - Part 1 9-11-01
CALLER:
CRO: You're welcome. Thank you.
OPERATOR: Bye.
CRO: Bye.
RECORDER: The preceding portion of tape concludes at 0913 hours, 36 seconds.
This tape will continue on side B.
OPERATOR NEWELL: blah blah.
GUY IN DESK: I speak words!'''
import re
from pprint import pprint
pprint(re.findall(r'^([^:\n]+):\s*(.*?)(?=^[^:\n]+?:|\Z)', s, flags=re.MULTILINE | re.DOTALL), width=200)

这输出：

[('CRO', 'How far are you from the World Trade Center, how many blocks, about? Three or four blocks?\n63FDNY 911 Calls Transcript - EMS - Part 1 9-11-01\n'),
 ('CALLER', ''),
 ('CRO', "You're welcome. Thank you.\n"),
 ('OPERATOR', 'Bye.\n'),
 ('CRO', 'Bye.\n'),
 ('RECORDER', 'The preceding portion of tape concludes at 0913 hours, 36 seconds.\nThis tape will continue on side B.\n'),
 ('OPERATOR NEWELL', 'blah blah.\n'),
 ('GUY IN DESK', 'I speak words!')]

如何捕获格式为 (name): (sentence)\n(name) 的文件中的所有句子：

How can I capture all sentences in a file with the format of (name): (sentence)\n(name):

python

regex

spacy

Demo