我是一个完整的初学者!如何将 .txt 文件(电影脚本)转换为 R 或 Python 中的 table(字符和行)?

I'm a complete beginner! How to I convert a .txt file (film script) to a table (characters and lines) in R or Python?

我是一个完全的初学者,为了一个大学项目,我需要分析电影剧本。我想创建一个 table,我可以在其中将字符与其行匹配。我的文件都是 .txt 格式,我想将它们转换为 csv 文件。我有很多脚本要处理,所以我想找到一个可以轻松适应不同文件的代码。

这是我的:

                            THREEPIO
      Did you hear that?  They've shut 
      down the main reactor.  We'll be 
      destroyed for sure.  This is 
      madness!


                THREEPIO
      We're doomed!


                THREEPIO
      There'll be no escape for the 
      Princess this time.

                THREEPIO
      What's that?

这就是我需要的:

"character" "dialogue"

"1" "THREEPIO" "Did you hear that? They've shut down the main reactor. We'll be destroyed for sure. This is madness!"

"2" "THREEPIO" "We're doomed!"

"3" "THREEPIO" "There'll be no escape for the Princess this time."

"4" "THREEPIO" "What's that?"

这是我试过的:

# the first 70 lines don't contain dialogues
# so we can start reading at line 70 (for instance)
i = 70

# while loop to extract character and dialogues
# (probably there's a better way to parse the file instead of
# using my crazy nested if-then-elses, but this works for me)
while (i <= nlines)
{
  # if empty line
  if (sw[i] == "") i = i + 1  # next line
  # if text line
  if (sw[i] != "")
  {
    # if uninteresting stuff
    if (substr(sw[i], 1, 1) != " ") {
      i = i + 1   # next line
    } else {
      if (nchar(sw[i]) < 10) {
        i = i + 1  # next line
      } else {
        if (substr(sw[i], 1, 5) != " " && substr(sw[i], 6, 6) != " ") {
          i = i + 1  # next line
        } else {
          # if character name
          if (substr(sw[i], 1, 30) == b30) 
          {
            if (substr(sw[i], 31, 31) != " ")
            {
              tmp_name = substr(sw[i], 31, nchar(sw[i], "bytes"))
              cat("\n", file="EpisodeVI_dialogues.txt", append=TRUE)
              cat(tmp_name, "", file="EpisodeVI_dialogues.txt", sep="\t", append=TRUE)
              i = i + 1        
            } else {
              i = i + 1
            }
          } else {
            # if dialogue
            if (substr(sw[i], 1, 15) == b15)
            {
              if (substr(sw[i], 16, 16) != " ")
              {
                tmp_diag = substr(sw[i], 16, nchar(sw[i], "bytes"))
                cat("", tmp_diag, file="EpisodeVI_dialogues.txt", append=TRUE)
                i = i + 1
              } else {
                i = i + 1
              }
            }
          }
        }
      }
    }    
  }
}

Any help would me much appreciated! Thank you!! 

你可以这样做:

text = """
 THREEPIO
      Did you hear that?  They've shut 
      down the main reactor.  We'll be 
      destroyed for sure.  This is 
      madness!


                THREEPIO
      We're doomed!


                THREEPIO
      There'll be no escape for the 
      Princess this time.

                THREEPIO
      What's that?
"""

clean = text.split()

n = 1
tmp = []
results = []
for element in clean:
    if element.isupper():
        if tmp:
            results.append(tmp)
        tmp = [n, element]
        n += 1
        continue
    try:
        tmp[2] = " ".join((tmp[2], element))
    except IndexError:
        tmp.append(element)

print(results)

结果:

[[1, 'THREEPIO', "Did you hear that? They've shut down the main reactor. We'll be destroyed for sure. This is madness!"], [2, 'THREEPIO', "We're doomed!"], [3, 'THREEPIO', "There'll be no escape for the Princess this time."]]

如果您知道字符名称列表(并且不担心拼写错误),那么这样的方法就可以了:

script = """
 THREEPIO
      Did you hear that?  They've shut 
      down the main reactor.  We'll be 
      destroyed for sure.  This is 
      madness!


                THREEPIO
      We're doomed!


                THREEPIO
      There'll be no escape for the 
      Princess this time.

                THREEPIO
      What's that?
"""

characters = ['THREEPIO', 'ANAKIN']
lines = [x for x in list(map(str.strip, script.split('\n'))) if x]
results = []
for (i, item) in enumerate(lines):
    if item in characters:
        dialogue = []
        for index in range(i + 1, len(lines)):
            if lines[index] in characters:
                break
            dialogue.append(lines[index])
        results.append([item, ' '.join(dialogue)])

print([x for x in enumerate(results, start=1)])

这会打印:

[(1, ['THREEPIO', "Did you hear that?  They've shut down the main reactor.  We'll be destroyed for sure.  This is madness!"]), (2, ['THREEPIO', "We're doomed!"]), (3, ['THREEPIO', "There'll be no escape for the Princess this time."]), (4, ['THREEPIO', "What's that?"])]