将字符名称及其行添加到数组/列表中的新字典

Question

我有一个电影剧本。我的第一份工作是把每个字的台词收集到字典里。

稍后我需要将数据放入一个系列中。

现在，我将所有对话都列在一个列表中，从角色名称开始。它的格式如下：

对话[0] 'NAME1\n（16 个空格）YO，YO，真好你在这里。'

所有名字都以\n结尾。然后所有对话行都以 16 个空格开头。我认为这可能会有用，但我不确定如何使用它。

我尝试了很多方法，但几乎没有运气。

    result = {}
    for lines in dialogue:
        first_token = para.split()[0]
        if first_token.endswith('\n'): #this would be the name
            name, line = para.split(on the new line?)
            name = name.strip()
            if name not in result:
                result[name] = []
            result[name].append(line)
    return result

这段代码给我一大堆错误，所以我认为在这里列出它们没有用。

理想情况下，我需要将每个字符作为字典中的第一个键，然后将它们的所有行作为数据。

像这样：

名称 1：[第 1 行，第 2 行，第 3 行...] 名称 2：[第 1 行，第 2 行，第 3 行...]

编辑：部分人物名字有两个字

编辑 2：也许回到原始电影脚本文本文件会更容易。

格式如下：

          NAME1
Yo, Yo, good that you're here
man.

          NAME2
     (Laughing)
I don't think that's good!  We were
at the club, smoking, laughing -- doing
stuff.

Answer 1

编辑后的答案：回到您的原始文件，如果我们可以假设所有字符名称前面都有 22 个空白字符，我们可以这样做：

example = """
                      NAME1
            Yo, Yo, good that you're here
            man.

                      NAME2
                 (Laughing)
            I don't think that's good!  We were
            at the club, smoking, laughing -- doing
            stuff.
"""

lines = example.split('\n')
characters = [line for line in lines if line.startswith(' ' * 22)]
result = {c.strip(): [] for c in characters}
current = ''
for line in lines:
    if line in characters:
        current = line.strip()
    elif current:
        result[current].append(line.strip())

现在的结果是：

{'NAME1': ["Yo, Yo, good that you're here", 'man.', ''], 'NAME2': ['(Laughing)', "I don't think that's good!  We were", 'at the club, smoking, laughing -- doing', 'stuff.', '']}

这可能需要一些额外的清理工作

Answer 2

方法一：

由'\n'分割并剥离。列表的第一个元素是名字，剩下的是你的台词。 str.pop 将就地修改您的列表。如果您的对话有多行，此解决方案将不起作用。

>>> dialogue
'NAME1\n                abc adbaiuho saidainbw\n                sadi waiudi qoweoq asodhoqndoqndqwdq.\n                qiudwqd aisdiqnd asfiqwofnqofoweqomdomkmq!!'
>>> lines = list(map(str.strip, dialogue.split('\n')))
>>> lines
['NAME1', 'abc adbaiuho saidainbw', 'sadi waiudi qoweoq asodhoqndoqndqwdq.', 'qiudwqd aisdiqnd asfiqwofnqofoweqomdomkmq!!']
>>> name = lines.pop(0)
>>> name
'NAME1'
>>> lines
['abc adbaiuho saidainbw', 'sadi waiudi qoweoq asodhoqndoqndqwdq.', 'qiudwqd aisdiqnd asfiqwofnqofoweqomdomkmq!!']

方法二：

当您有多行对话时，即对话可能包含 '\n' 字符，首先按第一次出现的 '\n' 字符拆分。第一个元素将是名称，下一个元素我们将进一步拆分为“16 个空格”。

>>> dialogue
'NAME1\n                abc adbaiuho saidainbw\n                sadi waiudi qoweoq asodhoqndoqndqwdq.\n                qiudwqd aisdiqnd asfiqwofnqofoweqomdomkmq!!'
>>> parse_temp = dialogue.split('\n',1)
>>> name = parse_temp[0]
>>> lines = parse_temp[1].split(" " * 16)[1:]
>>> name
'NAME1'
>>> lines
['abc adbaiuho saidainbw\n', 'sadi waiudi qoweoq asodhoqndoqndqwdq.\n', 'qiudwqd aisdiqnd asfiqwofnqofoweqomdomkmq!!']

作为函数，

def parse(dialogue):
    parse_temp = dialogue.split('\n',1)
    name = parse_temp[0].strip()
    lines = list(map(str.strip, parse_temp[1].split(" " * 16)[1:]))
    return name, lines

注意：对于第二次拆分，您可以使用您拥有的任何空白模式进行替换。您甚至可以使用正则表达式拆分它。我在这里使用了简单的16个空格。

根据迭代请求添加的代码：

data = dict()
for _dialogue in dialogue:
   name, lines = parse(_dialogue)
   data[name] = data.get(name, list()) + lines

Answer 3

拆分文本行
为每个演员创建带有唯一键的字典
向字典添加演员台词

编辑：在名称正则表达式中添加空格，去除名称空白

import re
lines = [
    "Dialogue[0] 'NAME1 \n                YO, YO, good that you're here man.'",
    "Dialogue[1] 'NAME 1\n                YO, YO, ",
    "Dialogue[2] 'NAME2\n                YO, YO, good that ",
    "Dialogue[3] 'NAME2\n                YO, YO, good that you're here'",
]

regex = h = re.compile("'([A-Z 0-9]+)\n[ ]{16}(.+)")
lineslist = [re.findall(regex, line) for line in lines]
lineslist = [ match[0] for match in lineslist if len(match)]
keys = [l[0].strip() for l in lineslist]
result = {k:[] for k in set(keys)}
[result[l[0].strip()].append(l[1]) for l in lineslist]
result

输出：

{'NAME 1': ['YO, YO, '],
 'NAME1': ["YO, YO, good that you're here man.'"],
 'NAME2': ['YO, YO, good that ', "YO, YO, good that you're here'"]}

将字符名称及其行添加到数组/列表中的新字典

Add character names and their lines to a new dictionary from array / list

python

regex

text

analysis

nltk