当一个角色说话时分裂麦克白

Splitting Macbeth When a Character Speaks

在向 Project Gutenberg 发送 get 请求后,我将完整的剧本 Macbeth 作为字符串

response = requests.get('https://www.gutenberg.org/cache/epub/2264/pg2264.txt')
full_text = response.text
macbeth = full_text[16648:]

我分了

words_raw = macbeth.split()
word_count = len(words_raw)

print("Macbeth contains {} words".format(word_count))
print("Here are some examples:", words_raw[400:460])

然后我去除 all 标点符号并将字符串转换为 lower()

import string
punctuation = string.punctuation

words_cleaned = []

for word in words_raw:
    # remove punctuation
    word = word.strip(punctuation)
    # make lowercase
    word = word.lower()
    words_cleaned.append(word)

print("Cleaned word examples:", words_cleaned[400:460])

但是,我不能去除所有标点符号,因为我需要names/shortened名字后面的句点作为角色即将说话的指示符。

课程节选

说话的角色由其名字的(通常是缩写的)版本后跟一个 . (句点)作为一行中的第一件事。因此,例如,当 Macbeth 说话时,它以“Macb”开头。您需要修改处理标点符号的方式,因为您不能只删除所有标点符号

split( ) 后的原始数据切片

名称后跟粗体句点

麦克白包含 17737 个单词 以下是一些示例:['Gashes'、'cry'、'for'、'helpe'、'King.'、'So'、'well'、'thy', 'words', 'become', 'thee,', 'as', 'thy', 'wounds,', 'They', 'smack', 'of', 'Honor', 'both:', 'Goe', 'get', 'him', 'Surgeons.', 'Enter', 'Rosse', 'and', 'Angus.', 'Who', 'comes', 'here?', 'Mal.', 'The', 'worthy', 'Thane', 'of', 'Rosse', 'Lenox.', 'What', 'a', 'haste', 'lookes', 'through', 'his', 'eyes?', 'So', 'should', 'he', 'looke,', 'that', 'seemes', 'to', 'speake', 'things', 'strange', 'Rosse.'、'God'、'saue'、'the'、'King'、'King.']

words_raw = macbeth.split()
word_count = len(words_raw)

print("Macbeth contains {} words".format(word_count))
print("Here are some examples:", words_raw[400:460])

我们知道 'Malcolm' 在他的名字后跟一个句号(上面的 'Mal.' 以粗体显示)时他正在说话,当他开始说话时 'Lenox' 也是如此('Lenox.') 有时角色的名字被缩短,其他人使用全名紧跟一个句点。

《麦克白》中最常见的名字列表

[“邓肯”、“马尔科姆”、“唐纳本”、“麦克白”、“班柯”、“麦克达夫”、“莱诺克斯”、“罗斯”、“门思”、“安格斯”、“凯瑟斯”、 "fleance", "seyward", "seyton", "boy", "lady", "messenger", "wife"]

目标

这是我到目前为止尝试过的方法

尝试隔离非字母数字

print(len(words_raw))
def extra(string):
    return list(c for c in string if not c.isalnum() and not c.isspace())
weird = extra(macbeth)
weird
​
discard = []
for char in weird:
    if char != '.':
        discard.append(char)
print(len(weird))
print(len(discard))
print(discard)
​
revised_macbeth = []
​
for character in words_raw:
    if not character in discard:
        revised_macbeth.append(character)
print(len(revised_macbeth))
        
        
​
# for character in words_raw:
#     if not character.isalnum():
#         print("found: \'{}\'".format(character))

它的输出

17737
4788
3553
['?', ',', ',', '?', '-', "'", ',', "'", ',', '?', ',', '-', ':', ',', ',', ',', ',', ',', ',', ',', '?', ',', ',', ',', "'", ':', ';', ',', ',', ',', ',', ',', ':', '(', ',', ')', "'", ',', ',', "'", ':', "'", ':', '(', ')', ',', ',', "'", '(', ')', "'", ',', "'", ':', "'", ',', ',', "'", "'", ',', "'", ',', "'", ',', ',', ':', ',', "'", ',', ':', ',', ',', ',', "'", ',', "'", ',', ',', ',', ',', ',', "'", ',', '?', ',', ',', ';', ',', ':', ',', '-', "'", ',', ':', ',', ',', ':', ',', ',', ',', ':', '?', '?', ',', "'", ',', '?', ',', ',', ',', ',', ',', ',', ',', ',', "'", ',', ',', '-', ',', ',', "'", ',', ':', ',', ',', ',', ':', ',', ',', ',', ',', ':', ',', ',', ',', '?', ',', '?', ',', ',', '&', ',', ':', ',', ',', ',', '-', "'", ',', "'", "'", ':', ',', ',', ',', ',', "'", ',', ',', ',', "'", "'", '-', ':', '-', ':', ':', "'", ',', ',', ',', ',', ':', ',', '-', ',', ',', ',', ',', ':', ',', ',', ',', ',', ',', ',', ',', ',', ',', ',', "'", "'", "'", '?', ',', "'", ',', ',', "'", "'", "'", ',', "'", '?', ',', '?', ',', ':', ',', ':', '?', ',', ',', ',', ',', ',', '?', "'", "'", ',', '?', ',', ',', ',', ':', ',', ',', ',', ',', 

比较

print(macbeth)
The Tragedie of Macbeth

Actus Primus. Scoena Prima.

Thunder and Lightning. Enter three Witches.

  1. When shall we three meet againe?
In Thunder, Lightning, or in Raine?
  2. When the Hurley-burley's done,
When the Battaile's lost, and wonne

   3. That will be ere the set of Sunne

   1. Where the place?
  2. Vpon the Heath

   3. There to meet with Macbeth

   1. I come, Gray-Malkin
print(revised_macbeth)
['The', 'Tragedie', 'of', 'Macbeth', 'Actus', 'Primus.', 'Scoena', 'Prima.', 'Thunder', 'and', 'Lightning.', 'Enter', 'three', 'Witches.', '1.', 'When', 'shall', 'we', 'three', 'meet', 'againe?', 'In', 'Thunder,', 'Lightning,', 'or', 'in', 'Raine?', '2.', 'When', 'the', "Hurley-burley's", 'done,', 'When', 'the', "Battaile's", 'lost,', 'and', 'wonne', '3.', 'That', 'will', 'be', 'ere', 'the', 'set', 'of', 'Sunne', '1.', 'Where', 'the', 'place?', '2.', 'Vpon', 'the', 'Heath', '3.', 'There', 'to', 'meet', 'with', 'Macbeth', '1.', 'I', 'come,', 'Gray-Malkin', 'All.', 'Padock', 'calls', 'anon:', 'faire', 'is', 'foule,', 'and', 'foule', 'is', 'faire,', 'Houer', 'through', 'the', 'fogge', 'and', 'filthie', 'ayre.', 'Exeunt.', 'Scena', 'Secunda.', 'Alarum', 'within.', 'Enter', 'King,', 'Malcome,', 'Donalbaine,', 'Lenox,', 'with', 'attendants,', 'meeting', 'a', 'bleeding', 'Captaine.', 'King.', 'What', 'bloody', 'man', 'is', 'that?', 'he', 'can', 'report,', 'As', 'seemeth', 'by', 'his', 'plight,', 'of', 'the', 'Reuolt', 'The', 'newest', 'state', 'Mal.', 'This', 'is', 'the', 'Serieant,', 'Who', 'like', 'a', 'good', 'and', 'hardie', 'Souldier', 'fought', "'Gainst", 'my', 'Captiuitie:', 'Haile', 'braue', 'friend;', 'Say', 'to', 'the', 'King,', 'the', 'knowledge', 'of', 'the', 'Broyle,', 'As', 'thou', 'didst', 'leaue', 'it', 'Cap.', 'Doubtfull', 'it', 'stood,', 'As', 'two', 'spent', 'Swimmers,', 'that', 'doe', 'cling', 'together,', 'And', 'choake', 'their', 'Art:', 'The', 'me

按照我上面的评论

You might have an easier time of it if you split into lines first, and then split into words, because I expect the abbreviated character names will always be at the start of a line? Also, I notice the line is indented a couple spaces when a new character starts speaking. That could be another thing to look for.

分成几行:

macbeth_lines = macbeth.split('\r\n') # Because in your text lines are separated by \r\n

然后,遍历每一行。如果它以 space 开头,请从第一个单词中删除句点以外的所有内容,并从其他单词中删除所有标点符号。如果它 以 space 开头,请从所有单词中删除所有标点符号。要替换所有字符,我们将使用 str.translate() (docs),它采用 dict 将每个输入字符映射到它的 t运行 指定输出字符。我们可以创建这个字典来将每个标点符号映射到一个空字符串。

# Create a dictionary for str.translate
strip_chars = {ord(punct): None for punct in string.punctuation}

# And one without the period
strip_chars_no_period = {k: v for k, v in strip_chars.items() if k != 46} # 46 is ord('.')

macbeth_words = []
for line in macbeth_lines:
    line_words = line.split()
    line_proc_words = [] # List to see each line as it's processed
                         # Remove if not needed

    if line.startswith(" "):
        # this line starts with a space. Maybe it contains a name

        # Don't strip periods from the first word
        first_word = line_words[0].translate(strip_chars_no_period)

        line_proc_words.append(first_word) # Debug line

        # Save the word
        macbeth_words.append(first_word)

        # Remaining words yet to be processed in this line
        remaining_words = line_words[1:]
    else:
        # All words in the line are yet to be processed
        remaining_words = line_words

    # Process remaining words
    for other_word in remaining_words:
        # Strip punctuation
        stripped_word = other_word.translate(strip_chars)

        line_proc_words.append(stripped_word) # Debug line

        # Save to list
        macbeth_words.append(stripped_word)
    
    # Print out the line just to make sure it's correct
    print(' '.join(line_proc_words)) # Debug line

我添加了一个 line_proc_words 列表,以便我们可以打印处理过的每一行。上面代码的输出(我 运行 它只针对前 100 行)看起来像这样:

The Tragedie of Macbeth

Actus Primus Scoena Prima

Thunder and Lightning Enter three Witches

1. When shall we three meet againe
In Thunder Lightning or in Raine
2. When the Hurleyburleys done
When the Battailes lost and wonne

3. That will be ere the set of Sunne

1. Where the place
2. Vpon the Heath

3. There to meet with Macbeth

1. I come GrayMalkin

All. Padock calls anon faire is foule and foule is faire
Houer through the fogge and filthie ayre

Exeunt


Scena Secunda

Alarum within Enter King Malcome Donalbaine Lenox with
attendants meeting a bleeding Captaine

King. What bloody man is that he can report
As seemeth by his plight of the Reuolt
The newest state

Mal. This is the Serieant
Who like a good and hardie Souldier fought
Gainst my Captiuitie Haile braue friend
Say to the King the knowledge of the Broyle
As thou didst leaue it

Cap. Doubtfull it stood
As two spent Swimmers that doe cling together
And choake their Art The mercilesse Macdonwald
Worthie to be a Rebell for to that
The multiplying Villanies of Nature
Doe swarme vpon him from the Westerne Isles
Of Kernes and Gallowgrosses is supplyd
And Fortune on his damned Quarry smiling
Shewd like a Rebells Whore but alls too weake
For braue Macbeth well hee deserues that Name
Disdayning Fortune with his brandisht Steele
Which smoakd with bloody execution
Like Valours Minion carud out his passage
Till hee facd the Slaue
Which neur shooke hands nor bad farwell to him
Till he vnseamd him from the Naue toth Chops
And fixd his Head vpon our Battlements

King. O valiant Cousin worthy Gentleman

Cap. As whence the Sunne gins his reflection
Shipwracking Stormes and direfull Thunders
So from that Spring whence comfort seemd to come
Discomfort swells Marke King of Scotland marke
No sooner Iustice had with Valour armd
Compelld these skipping Kernes to trust their heeles
But the Norweyan Lord surueying vantage
With furbusht Armes and new supplyes of men
Began a fresh assault

King. Dismayd not this our Captaines Macbeth and
Banquoh
Cap. Yes as Sparrowes Eagles
Or the Hare the Lyon
If I say sooth I must report they were
As Cannons ouerchargd with double Cracks
So they doubly redoubled stroakes vpon the Foe
Except they meant to bathe in reeking Wounds
Or memorize another Golgotha
I cannot tell but I am faint
My Gashes cry for helpe

King. So well thy words become thee as thy wounds
They smack of Honor both Goe get him Surgeons
Enter Rosse and Angus

Who comes here
Mal. The worthy Thane of Rosse

Lenox. What a haste lookes through his eyes
So should he looke that seemes to speake things strange

Rosse. God saue the King

King. Whence camst thou worthy Thane
Rosse. From Fiffe great King
Where the Norweyan Banners flowt the Skie
And fanne our people cold
Norway himselfe with terrible numbers

您可以使用collections.defaultdict 将发言人姓名上的行分组。 enumerate 可用于获取字符每次出现的文本的行号:

import requests, re
from collections import defaultdict
r = requests.get('https://www.gutenberg.org/cache/epub/2264/pg2264.txt').text
d, l, keywords = defaultdict(list), None, ['Enter', 'Exit', 'Flourish', 'Thunder']
#iterate over the play lines, ignoring empty strings (generated from the split)
for i, a in filter(lambda x:x[-1], enumerate(re.split('[\n\r]+', r[r.index('Actus Primus. Scoena Prima.')+27:]))):
   #check that the line contains character dialog, not stage prompts
   if not re.findall('|'.join(keywords), a):
      #grab the name of the character and append to "d"
      if (n:=re.findall('^\s+[A-Z](?:\.[A-Z])*[a-z]+\.(?=\s\w+)|^[A-Z](?:\.[A-Z])*[a-z\.]+\.(?=\s\w+)', a)):
         d[(l:=re.sub('^\s+|\.$', '', n[0]).lower())].append((i, a[len(n[0])+1:].lower()))
      elif l:
         #the line might be a continuation of a larger block of character text
         d[l].append((i, a.lower()))

print(list(d.keys())) #detected characters
print(d['macb'][:10]) #first ten occurrences of Macbeth speaking

输出:

['all', 'king', 'mal', 'cap', 'lenox', 'rosse', 'macb', 'banquo', 'mac', 'banq', 'ang', 'lady', 'mess', 'la', 'fleance', 'porter', 'macd', 'port', 'exeunt', 'ban', 'donal', 'malc', 'don', 'ross', 'seruant', 'murth', 'lords', 'mur', 'len', 'hec', 'lord', 'appar', 'musicke', 'wife', 'son', 'mes', 'doct', 'ro', 'gent', 'lad', 'ment', 'cath', 'ser', 'sey', 'seyw', 'sold', 'syw', 'y.sey']
[(137, 'so foule and faire a day i haue not seene'), (170, 'stay you imperfect speakers, tell me more:'), (171, 'by sinells death, i know i am thane of glamis,'), (172, 'but how, of cawdor? the thane of cawdor liues'), (173, 'a prosperous gentleman: and to be king,'), (174, 'stands not within the prospect of beleefe,'), (175, 'no more then to be cawdor. say from whence'), (176, 'you owe this strange intelligence, or why'), (177, 'vpon this blasted heath you stop our way'), (178, 'with such prophetique greeting?')]

编辑:每个字符的常用词:

要过滤每个字符的常用词,迭代 d 中每个字符的句子,然后再次迭代每个句子的 str.split 结果。需要注意的是,上一步的结果会包含很多stop words。我的以下解决方案为您提供了过滤这些选项的选项:

from collections import Counter
def common_words(character, filter_stop = False, stop_words = []):
   if filter_stop:
      stop_words = set(filter(None, requests.get("https://gist.githubusercontent.com/sebleier/554280/raw/7e0e4a1ce04c2bb7bd41089c9821dbcf6d0c786c/NLTK's%2520list%2520of%2520english%2520stopwords").text.split('\n')))
   w = [i for _, b in d['Macb'] for i in re.sub('[\:\.\?]+', '', b).split() if i.lower() not in stop_words]
   return Counter(w).most_common(5)

print(common_words('Macb', filter_stop=True))

输出:

[('haue', 39), ('thou', 34), ('thy', 23), ('shall', 21), ('thee', 20)]