当一个角色说话时分裂麦克白
Splitting Macbeth When a Character Speaks
在向 Project Gutenberg 发送 get 请求后,我将完整的剧本 Macbeth 作为字符串
response = requests.get('https://www.gutenberg.org/cache/epub/2264/pg2264.txt')
full_text = response.text
macbeth = full_text[16648:]
我分了
words_raw = macbeth.split()
word_count = len(words_raw)
print("Macbeth contains {} words".format(word_count))
print("Here are some examples:", words_raw[400:460])
然后我去除 all 标点符号并将字符串转换为 lower()
import string
punctuation = string.punctuation
words_cleaned = []
for word in words_raw:
# remove punctuation
word = word.strip(punctuation)
# make lowercase
word = word.lower()
words_cleaned.append(word)
print("Cleaned word examples:", words_cleaned[400:460])
但是,我不能去除所有标点符号,因为我需要names/shortened名字后面的句点作为角色即将说话的指示符。
课程节选
说话的角色由其名字的(通常是缩写的)版本后跟一个 . (句点)作为一行中的第一件事。因此,例如,当 Macbeth 说话时,它以“Macb”开头。您需要修改处理标点符号的方式,因为您不能只删除所有标点符号
split( ) 后的原始数据切片
名称后跟粗体句点
麦克白包含 17737 个单词
以下是一些示例:['Gashes'、'cry'、'for'、'helpe'、'King.'、'So'、'well'、'thy', 'words', 'become', 'thee,', 'as', 'thy', 'wounds,', 'They', 'smack', 'of', 'Honor', 'both:', 'Goe', 'get', 'him', 'Surgeons.', 'Enter', 'Rosse', 'and', 'Angus.', 'Who', 'comes', 'here?', 'Mal.', 'The', 'worthy', 'Thane', 'of', 'Rosse', 'Lenox.', 'What', 'a', 'haste', 'lookes', 'through', 'his', 'eyes?', 'So', 'should', 'he', 'looke,', 'that', 'seemes', 'to', 'speake', 'things', 'strange', 'Rosse.'、'God'、'saue'、'the'、'King'、'King.']
words_raw = macbeth.split()
word_count = len(words_raw)
print("Macbeth contains {} words".format(word_count))
print("Here are some examples:", words_raw[400:460])
我们知道 'Malcolm' 在他的名字后跟一个句号(上面的 'Mal.' 以粗体显示)时他正在说话,当他开始说话时 'Lenox' 也是如此('Lenox.') 有时角色的名字被缩短,其他人使用全名紧跟一个句点。
《麦克白》中最常见的名字列表
[“邓肯”、“马尔科姆”、“唐纳本”、“麦克白”、“班柯”、“麦克达夫”、“莱诺克斯”、“罗斯”、“门思”、“安格斯”、“凯瑟斯”、 "fleance", "seyward", "seyton", "boy", "lady", "messenger", "wife"]
目标
- 从上面的列表中找出字符的所有名称和缩写名称,如果缩短的话
- 找到一个角色开始说话的地方,用句点表示,并在那里分割
这是我到目前为止尝试过的方法
尝试隔离非字母数字
print(len(words_raw))
def extra(string):
return list(c for c in string if not c.isalnum() and not c.isspace())
weird = extra(macbeth)
weird
discard = []
for char in weird:
if char != '.':
discard.append(char)
print(len(weird))
print(len(discard))
print(discard)
revised_macbeth = []
for character in words_raw:
if not character in discard:
revised_macbeth.append(character)
print(len(revised_macbeth))
# for character in words_raw:
# if not character.isalnum():
# print("found: \'{}\'".format(character))
它的输出
17737
4788
3553
['?', ',', ',', '?', '-', "'", ',', "'", ',', '?', ',', '-', ':', ',', ',', ',', ',', ',', ',', ',', '?', ',', ',', ',', "'", ':', ';', ',', ',', ',', ',', ',', ':', '(', ',', ')', "'", ',', ',', "'", ':', "'", ':', '(', ')', ',', ',', "'", '(', ')', "'", ',', "'", ':', "'", ',', ',', "'", "'", ',', "'", ',', "'", ',', ',', ':', ',', "'", ',', ':', ',', ',', ',', "'", ',', "'", ',', ',', ',', ',', ',', "'", ',', '?', ',', ',', ';', ',', ':', ',', '-', "'", ',', ':', ',', ',', ':', ',', ',', ',', ':', '?', '?', ',', "'", ',', '?', ',', ',', ',', ',', ',', ',', ',', ',', "'", ',', ',', '-', ',', ',', "'", ',', ':', ',', ',', ',', ':', ',', ',', ',', ',', ':', ',', ',', ',', '?', ',', '?', ',', ',', '&', ',', ':', ',', ',', ',', '-', "'", ',', "'", "'", ':', ',', ',', ',', ',', "'", ',', ',', ',', "'", "'", '-', ':', '-', ':', ':', "'", ',', ',', ',', ',', ':', ',', '-', ',', ',', ',', ',', ':', ',', ',', ',', ',', ',', ',', ',', ',', ',', ',', "'", "'", "'", '?', ',', "'", ',', ',', "'", "'", "'", ',', "'", '?', ',', '?', ',', ':', ',', ':', '?', ',', ',', ',', ',', ',', '?', "'", "'", ',', '?', ',', ',', ',', ':', ',', ',', ',', ',',
比较
print(macbeth)
The Tragedie of Macbeth
Actus Primus. Scoena Prima.
Thunder and Lightning. Enter three Witches.
1. When shall we three meet againe?
In Thunder, Lightning, or in Raine?
2. When the Hurley-burley's done,
When the Battaile's lost, and wonne
3. That will be ere the set of Sunne
1. Where the place?
2. Vpon the Heath
3. There to meet with Macbeth
1. I come, Gray-Malkin
print(revised_macbeth)
['The', 'Tragedie', 'of', 'Macbeth', 'Actus', 'Primus.', 'Scoena', 'Prima.', 'Thunder', 'and', 'Lightning.', 'Enter', 'three', 'Witches.', '1.', 'When', 'shall', 'we', 'three', 'meet', 'againe?', 'In', 'Thunder,', 'Lightning,', 'or', 'in', 'Raine?', '2.', 'When', 'the', "Hurley-burley's", 'done,', 'When', 'the', "Battaile's", 'lost,', 'and', 'wonne', '3.', 'That', 'will', 'be', 'ere', 'the', 'set', 'of', 'Sunne', '1.', 'Where', 'the', 'place?', '2.', 'Vpon', 'the', 'Heath', '3.', 'There', 'to', 'meet', 'with', 'Macbeth', '1.', 'I', 'come,', 'Gray-Malkin', 'All.', 'Padock', 'calls', 'anon:', 'faire', 'is', 'foule,', 'and', 'foule', 'is', 'faire,', 'Houer', 'through', 'the', 'fogge', 'and', 'filthie', 'ayre.', 'Exeunt.', 'Scena', 'Secunda.', 'Alarum', 'within.', 'Enter', 'King,', 'Malcome,', 'Donalbaine,', 'Lenox,', 'with', 'attendants,', 'meeting', 'a', 'bleeding', 'Captaine.', 'King.', 'What', 'bloody', 'man', 'is', 'that?', 'he', 'can', 'report,', 'As', 'seemeth', 'by', 'his', 'plight,', 'of', 'the', 'Reuolt', 'The', 'newest', 'state', 'Mal.', 'This', 'is', 'the', 'Serieant,', 'Who', 'like', 'a', 'good', 'and', 'hardie', 'Souldier', 'fought', "'Gainst", 'my', 'Captiuitie:', 'Haile', 'braue', 'friend;', 'Say', 'to', 'the', 'King,', 'the', 'knowledge', 'of', 'the', 'Broyle,', 'As', 'thou', 'didst', 'leaue', 'it', 'Cap.', 'Doubtfull', 'it', 'stood,', 'As', 'two', 'spent', 'Swimmers,', 'that', 'doe', 'cling', 'together,', 'And', 'choake', 'their', 'Art:', 'The', 'me
按照我上面的评论
You might have an easier time of it if you split into lines first, and then split into words, because I expect the abbreviated character names will always be at the start of a line? Also, I notice the line is indented a couple spaces when a new character starts speaking. That could be another thing to look for.
分成几行:
macbeth_lines = macbeth.split('\r\n') # Because in your text lines are separated by \r\n
然后,遍历每一行。如果它以 space 开头,请从第一个单词中删除句点以外的所有内容,并从其他单词中删除所有标点符号。如果它 不 以 space 开头,请从所有单词中删除所有标点符号。要替换所有字符,我们将使用 str.translate()
(docs),它采用 dict
将每个输入字符映射到它的 t运行 指定输出字符。我们可以创建这个字典来将每个标点符号映射到一个空字符串。
# Create a dictionary for str.translate
strip_chars = {ord(punct): None for punct in string.punctuation}
# And one without the period
strip_chars_no_period = {k: v for k, v in strip_chars.items() if k != 46} # 46 is ord('.')
macbeth_words = []
for line in macbeth_lines:
line_words = line.split()
line_proc_words = [] # List to see each line as it's processed
# Remove if not needed
if line.startswith(" "):
# this line starts with a space. Maybe it contains a name
# Don't strip periods from the first word
first_word = line_words[0].translate(strip_chars_no_period)
line_proc_words.append(first_word) # Debug line
# Save the word
macbeth_words.append(first_word)
# Remaining words yet to be processed in this line
remaining_words = line_words[1:]
else:
# All words in the line are yet to be processed
remaining_words = line_words
# Process remaining words
for other_word in remaining_words:
# Strip punctuation
stripped_word = other_word.translate(strip_chars)
line_proc_words.append(stripped_word) # Debug line
# Save to list
macbeth_words.append(stripped_word)
# Print out the line just to make sure it's correct
print(' '.join(line_proc_words)) # Debug line
我添加了一个 line_proc_words
列表,以便我们可以打印处理过的每一行。上面代码的输出(我 运行 它只针对前 100 行)看起来像这样:
The Tragedie of Macbeth
Actus Primus Scoena Prima
Thunder and Lightning Enter three Witches
1. When shall we three meet againe
In Thunder Lightning or in Raine
2. When the Hurleyburleys done
When the Battailes lost and wonne
3. That will be ere the set of Sunne
1. Where the place
2. Vpon the Heath
3. There to meet with Macbeth
1. I come GrayMalkin
All. Padock calls anon faire is foule and foule is faire
Houer through the fogge and filthie ayre
Exeunt
Scena Secunda
Alarum within Enter King Malcome Donalbaine Lenox with
attendants meeting a bleeding Captaine
King. What bloody man is that he can report
As seemeth by his plight of the Reuolt
The newest state
Mal. This is the Serieant
Who like a good and hardie Souldier fought
Gainst my Captiuitie Haile braue friend
Say to the King the knowledge of the Broyle
As thou didst leaue it
Cap. Doubtfull it stood
As two spent Swimmers that doe cling together
And choake their Art The mercilesse Macdonwald
Worthie to be a Rebell for to that
The multiplying Villanies of Nature
Doe swarme vpon him from the Westerne Isles
Of Kernes and Gallowgrosses is supplyd
And Fortune on his damned Quarry smiling
Shewd like a Rebells Whore but alls too weake
For braue Macbeth well hee deserues that Name
Disdayning Fortune with his brandisht Steele
Which smoakd with bloody execution
Like Valours Minion carud out his passage
Till hee facd the Slaue
Which neur shooke hands nor bad farwell to him
Till he vnseamd him from the Naue toth Chops
And fixd his Head vpon our Battlements
King. O valiant Cousin worthy Gentleman
Cap. As whence the Sunne gins his reflection
Shipwracking Stormes and direfull Thunders
So from that Spring whence comfort seemd to come
Discomfort swells Marke King of Scotland marke
No sooner Iustice had with Valour armd
Compelld these skipping Kernes to trust their heeles
But the Norweyan Lord surueying vantage
With furbusht Armes and new supplyes of men
Began a fresh assault
King. Dismayd not this our Captaines Macbeth and
Banquoh
Cap. Yes as Sparrowes Eagles
Or the Hare the Lyon
If I say sooth I must report they were
As Cannons ouerchargd with double Cracks
So they doubly redoubled stroakes vpon the Foe
Except they meant to bathe in reeking Wounds
Or memorize another Golgotha
I cannot tell but I am faint
My Gashes cry for helpe
King. So well thy words become thee as thy wounds
They smack of Honor both Goe get him Surgeons
Enter Rosse and Angus
Who comes here
Mal. The worthy Thane of Rosse
Lenox. What a haste lookes through his eyes
So should he looke that seemes to speake things strange
Rosse. God saue the King
King. Whence camst thou worthy Thane
Rosse. From Fiffe great King
Where the Norweyan Banners flowt the Skie
And fanne our people cold
Norway himselfe with terrible numbers
您可以使用collections.defaultdict
将发言人姓名上的行分组。 enumerate
可用于获取字符每次出现的文本的行号:
import requests, re
from collections import defaultdict
r = requests.get('https://www.gutenberg.org/cache/epub/2264/pg2264.txt').text
d, l, keywords = defaultdict(list), None, ['Enter', 'Exit', 'Flourish', 'Thunder']
#iterate over the play lines, ignoring empty strings (generated from the split)
for i, a in filter(lambda x:x[-1], enumerate(re.split('[\n\r]+', r[r.index('Actus Primus. Scoena Prima.')+27:]))):
#check that the line contains character dialog, not stage prompts
if not re.findall('|'.join(keywords), a):
#grab the name of the character and append to "d"
if (n:=re.findall('^\s+[A-Z](?:\.[A-Z])*[a-z]+\.(?=\s\w+)|^[A-Z](?:\.[A-Z])*[a-z\.]+\.(?=\s\w+)', a)):
d[(l:=re.sub('^\s+|\.$', '', n[0]).lower())].append((i, a[len(n[0])+1:].lower()))
elif l:
#the line might be a continuation of a larger block of character text
d[l].append((i, a.lower()))
print(list(d.keys())) #detected characters
print(d['macb'][:10]) #first ten occurrences of Macbeth speaking
输出:
['all', 'king', 'mal', 'cap', 'lenox', 'rosse', 'macb', 'banquo', 'mac', 'banq', 'ang', 'lady', 'mess', 'la', 'fleance', 'porter', 'macd', 'port', 'exeunt', 'ban', 'donal', 'malc', 'don', 'ross', 'seruant', 'murth', 'lords', 'mur', 'len', 'hec', 'lord', 'appar', 'musicke', 'wife', 'son', 'mes', 'doct', 'ro', 'gent', 'lad', 'ment', 'cath', 'ser', 'sey', 'seyw', 'sold', 'syw', 'y.sey']
[(137, 'so foule and faire a day i haue not seene'), (170, 'stay you imperfect speakers, tell me more:'), (171, 'by sinells death, i know i am thane of glamis,'), (172, 'but how, of cawdor? the thane of cawdor liues'), (173, 'a prosperous gentleman: and to be king,'), (174, 'stands not within the prospect of beleefe,'), (175, 'no more then to be cawdor. say from whence'), (176, 'you owe this strange intelligence, or why'), (177, 'vpon this blasted heath you stop our way'), (178, 'with such prophetique greeting?')]
编辑:每个字符的常用词:
要过滤每个字符的常用词,迭代 d
中每个字符的句子,然后再次迭代每个句子的 str.split
结果。需要注意的是,上一步的结果会包含很多stop words。我的以下解决方案为您提供了过滤这些选项的选项:
from collections import Counter
def common_words(character, filter_stop = False, stop_words = []):
if filter_stop:
stop_words = set(filter(None, requests.get("https://gist.githubusercontent.com/sebleier/554280/raw/7e0e4a1ce04c2bb7bd41089c9821dbcf6d0c786c/NLTK's%2520list%2520of%2520english%2520stopwords").text.split('\n')))
w = [i for _, b in d['Macb'] for i in re.sub('[\:\.\?]+', '', b).split() if i.lower() not in stop_words]
return Counter(w).most_common(5)
print(common_words('Macb', filter_stop=True))
输出:
[('haue', 39), ('thou', 34), ('thy', 23), ('shall', 21), ('thee', 20)]
在向 Project Gutenberg 发送 get 请求后,我将完整的剧本 Macbeth 作为字符串
response = requests.get('https://www.gutenberg.org/cache/epub/2264/pg2264.txt')
full_text = response.text
macbeth = full_text[16648:]
我分了
words_raw = macbeth.split()
word_count = len(words_raw)
print("Macbeth contains {} words".format(word_count))
print("Here are some examples:", words_raw[400:460])
然后我去除 all 标点符号并将字符串转换为 lower()
import string
punctuation = string.punctuation
words_cleaned = []
for word in words_raw:
# remove punctuation
word = word.strip(punctuation)
# make lowercase
word = word.lower()
words_cleaned.append(word)
print("Cleaned word examples:", words_cleaned[400:460])
但是,我不能去除所有标点符号,因为我需要names/shortened名字后面的句点作为角色即将说话的指示符。
课程节选
说话的角色由其名字的(通常是缩写的)版本后跟一个 . (句点)作为一行中的第一件事。因此,例如,当 Macbeth 说话时,它以“Macb”开头。您需要修改处理标点符号的方式,因为您不能只删除所有标点符号
split( ) 后的原始数据切片
名称后跟粗体句点
麦克白包含 17737 个单词 以下是一些示例:['Gashes'、'cry'、'for'、'helpe'、'King.'、'So'、'well'、'thy', 'words', 'become', 'thee,', 'as', 'thy', 'wounds,', 'They', 'smack', 'of', 'Honor', 'both:', 'Goe', 'get', 'him', 'Surgeons.', 'Enter', 'Rosse', 'and', 'Angus.', 'Who', 'comes', 'here?', 'Mal.', 'The', 'worthy', 'Thane', 'of', 'Rosse', 'Lenox.', 'What', 'a', 'haste', 'lookes', 'through', 'his', 'eyes?', 'So', 'should', 'he', 'looke,', 'that', 'seemes', 'to', 'speake', 'things', 'strange', 'Rosse.'、'God'、'saue'、'the'、'King'、'King.']
words_raw = macbeth.split()
word_count = len(words_raw)
print("Macbeth contains {} words".format(word_count))
print("Here are some examples:", words_raw[400:460])
我们知道 'Malcolm' 在他的名字后跟一个句号(上面的 'Mal.' 以粗体显示)时他正在说话,当他开始说话时 'Lenox' 也是如此('Lenox.') 有时角色的名字被缩短,其他人使用全名紧跟一个句点。
《麦克白》中最常见的名字列表
[“邓肯”、“马尔科姆”、“唐纳本”、“麦克白”、“班柯”、“麦克达夫”、“莱诺克斯”、“罗斯”、“门思”、“安格斯”、“凯瑟斯”、 "fleance", "seyward", "seyton", "boy", "lady", "messenger", "wife"]
目标
- 从上面的列表中找出字符的所有名称和缩写名称,如果缩短的话
- 找到一个角色开始说话的地方,用句点表示,并在那里分割
这是我到目前为止尝试过的方法
尝试隔离非字母数字
print(len(words_raw))
def extra(string):
return list(c for c in string if not c.isalnum() and not c.isspace())
weird = extra(macbeth)
weird
discard = []
for char in weird:
if char != '.':
discard.append(char)
print(len(weird))
print(len(discard))
print(discard)
revised_macbeth = []
for character in words_raw:
if not character in discard:
revised_macbeth.append(character)
print(len(revised_macbeth))
# for character in words_raw:
# if not character.isalnum():
# print("found: \'{}\'".format(character))
它的输出
17737
4788
3553
['?', ',', ',', '?', '-', "'", ',', "'", ',', '?', ',', '-', ':', ',', ',', ',', ',', ',', ',', ',', '?', ',', ',', ',', "'", ':', ';', ',', ',', ',', ',', ',', ':', '(', ',', ')', "'", ',', ',', "'", ':', "'", ':', '(', ')', ',', ',', "'", '(', ')', "'", ',', "'", ':', "'", ',', ',', "'", "'", ',', "'", ',', "'", ',', ',', ':', ',', "'", ',', ':', ',', ',', ',', "'", ',', "'", ',', ',', ',', ',', ',', "'", ',', '?', ',', ',', ';', ',', ':', ',', '-', "'", ',', ':', ',', ',', ':', ',', ',', ',', ':', '?', '?', ',', "'", ',', '?', ',', ',', ',', ',', ',', ',', ',', ',', "'", ',', ',', '-', ',', ',', "'", ',', ':', ',', ',', ',', ':', ',', ',', ',', ',', ':', ',', ',', ',', '?', ',', '?', ',', ',', '&', ',', ':', ',', ',', ',', '-', "'", ',', "'", "'", ':', ',', ',', ',', ',', "'", ',', ',', ',', "'", "'", '-', ':', '-', ':', ':', "'", ',', ',', ',', ',', ':', ',', '-', ',', ',', ',', ',', ':', ',', ',', ',', ',', ',', ',', ',', ',', ',', ',', "'", "'", "'", '?', ',', "'", ',', ',', "'", "'", "'", ',', "'", '?', ',', '?', ',', ':', ',', ':', '?', ',', ',', ',', ',', ',', '?', "'", "'", ',', '?', ',', ',', ',', ':', ',', ',', ',', ',',
比较
print(macbeth)
The Tragedie of Macbeth
Actus Primus. Scoena Prima.
Thunder and Lightning. Enter three Witches.
1. When shall we three meet againe?
In Thunder, Lightning, or in Raine?
2. When the Hurley-burley's done,
When the Battaile's lost, and wonne
3. That will be ere the set of Sunne
1. Where the place?
2. Vpon the Heath
3. There to meet with Macbeth
1. I come, Gray-Malkin
print(revised_macbeth)
['The', 'Tragedie', 'of', 'Macbeth', 'Actus', 'Primus.', 'Scoena', 'Prima.', 'Thunder', 'and', 'Lightning.', 'Enter', 'three', 'Witches.', '1.', 'When', 'shall', 'we', 'three', 'meet', 'againe?', 'In', 'Thunder,', 'Lightning,', 'or', 'in', 'Raine?', '2.', 'When', 'the', "Hurley-burley's", 'done,', 'When', 'the', "Battaile's", 'lost,', 'and', 'wonne', '3.', 'That', 'will', 'be', 'ere', 'the', 'set', 'of', 'Sunne', '1.', 'Where', 'the', 'place?', '2.', 'Vpon', 'the', 'Heath', '3.', 'There', 'to', 'meet', 'with', 'Macbeth', '1.', 'I', 'come,', 'Gray-Malkin', 'All.', 'Padock', 'calls', 'anon:', 'faire', 'is', 'foule,', 'and', 'foule', 'is', 'faire,', 'Houer', 'through', 'the', 'fogge', 'and', 'filthie', 'ayre.', 'Exeunt.', 'Scena', 'Secunda.', 'Alarum', 'within.', 'Enter', 'King,', 'Malcome,', 'Donalbaine,', 'Lenox,', 'with', 'attendants,', 'meeting', 'a', 'bleeding', 'Captaine.', 'King.', 'What', 'bloody', 'man', 'is', 'that?', 'he', 'can', 'report,', 'As', 'seemeth', 'by', 'his', 'plight,', 'of', 'the', 'Reuolt', 'The', 'newest', 'state', 'Mal.', 'This', 'is', 'the', 'Serieant,', 'Who', 'like', 'a', 'good', 'and', 'hardie', 'Souldier', 'fought', "'Gainst", 'my', 'Captiuitie:', 'Haile', 'braue', 'friend;', 'Say', 'to', 'the', 'King,', 'the', 'knowledge', 'of', 'the', 'Broyle,', 'As', 'thou', 'didst', 'leaue', 'it', 'Cap.', 'Doubtfull', 'it', 'stood,', 'As', 'two', 'spent', 'Swimmers,', 'that', 'doe', 'cling', 'together,', 'And', 'choake', 'their', 'Art:', 'The', 'me
按照我上面的评论
You might have an easier time of it if you split into lines first, and then split into words, because I expect the abbreviated character names will always be at the start of a line? Also, I notice the line is indented a couple spaces when a new character starts speaking. That could be another thing to look for.
分成几行:
macbeth_lines = macbeth.split('\r\n') # Because in your text lines are separated by \r\n
然后,遍历每一行。如果它以 space 开头,请从第一个单词中删除句点以外的所有内容,并从其他单词中删除所有标点符号。如果它 不 以 space 开头,请从所有单词中删除所有标点符号。要替换所有字符,我们将使用 str.translate()
(docs),它采用 dict
将每个输入字符映射到它的 t运行 指定输出字符。我们可以创建这个字典来将每个标点符号映射到一个空字符串。
# Create a dictionary for str.translate
strip_chars = {ord(punct): None for punct in string.punctuation}
# And one without the period
strip_chars_no_period = {k: v for k, v in strip_chars.items() if k != 46} # 46 is ord('.')
macbeth_words = []
for line in macbeth_lines:
line_words = line.split()
line_proc_words = [] # List to see each line as it's processed
# Remove if not needed
if line.startswith(" "):
# this line starts with a space. Maybe it contains a name
# Don't strip periods from the first word
first_word = line_words[0].translate(strip_chars_no_period)
line_proc_words.append(first_word) # Debug line
# Save the word
macbeth_words.append(first_word)
# Remaining words yet to be processed in this line
remaining_words = line_words[1:]
else:
# All words in the line are yet to be processed
remaining_words = line_words
# Process remaining words
for other_word in remaining_words:
# Strip punctuation
stripped_word = other_word.translate(strip_chars)
line_proc_words.append(stripped_word) # Debug line
# Save to list
macbeth_words.append(stripped_word)
# Print out the line just to make sure it's correct
print(' '.join(line_proc_words)) # Debug line
我添加了一个 line_proc_words
列表,以便我们可以打印处理过的每一行。上面代码的输出(我 运行 它只针对前 100 行)看起来像这样:
The Tragedie of Macbeth
Actus Primus Scoena Prima
Thunder and Lightning Enter three Witches
1. When shall we three meet againe
In Thunder Lightning or in Raine
2. When the Hurleyburleys done
When the Battailes lost and wonne
3. That will be ere the set of Sunne
1. Where the place
2. Vpon the Heath
3. There to meet with Macbeth
1. I come GrayMalkin
All. Padock calls anon faire is foule and foule is faire
Houer through the fogge and filthie ayre
Exeunt
Scena Secunda
Alarum within Enter King Malcome Donalbaine Lenox with
attendants meeting a bleeding Captaine
King. What bloody man is that he can report
As seemeth by his plight of the Reuolt
The newest state
Mal. This is the Serieant
Who like a good and hardie Souldier fought
Gainst my Captiuitie Haile braue friend
Say to the King the knowledge of the Broyle
As thou didst leaue it
Cap. Doubtfull it stood
As two spent Swimmers that doe cling together
And choake their Art The mercilesse Macdonwald
Worthie to be a Rebell for to that
The multiplying Villanies of Nature
Doe swarme vpon him from the Westerne Isles
Of Kernes and Gallowgrosses is supplyd
And Fortune on his damned Quarry smiling
Shewd like a Rebells Whore but alls too weake
For braue Macbeth well hee deserues that Name
Disdayning Fortune with his brandisht Steele
Which smoakd with bloody execution
Like Valours Minion carud out his passage
Till hee facd the Slaue
Which neur shooke hands nor bad farwell to him
Till he vnseamd him from the Naue toth Chops
And fixd his Head vpon our Battlements
King. O valiant Cousin worthy Gentleman
Cap. As whence the Sunne gins his reflection
Shipwracking Stormes and direfull Thunders
So from that Spring whence comfort seemd to come
Discomfort swells Marke King of Scotland marke
No sooner Iustice had with Valour armd
Compelld these skipping Kernes to trust their heeles
But the Norweyan Lord surueying vantage
With furbusht Armes and new supplyes of men
Began a fresh assault
King. Dismayd not this our Captaines Macbeth and
Banquoh
Cap. Yes as Sparrowes Eagles
Or the Hare the Lyon
If I say sooth I must report they were
As Cannons ouerchargd with double Cracks
So they doubly redoubled stroakes vpon the Foe
Except they meant to bathe in reeking Wounds
Or memorize another Golgotha
I cannot tell but I am faint
My Gashes cry for helpe
King. So well thy words become thee as thy wounds
They smack of Honor both Goe get him Surgeons
Enter Rosse and Angus
Who comes here
Mal. The worthy Thane of Rosse
Lenox. What a haste lookes through his eyes
So should he looke that seemes to speake things strange
Rosse. God saue the King
King. Whence camst thou worthy Thane
Rosse. From Fiffe great King
Where the Norweyan Banners flowt the Skie
And fanne our people cold
Norway himselfe with terrible numbers
您可以使用collections.defaultdict
将发言人姓名上的行分组。 enumerate
可用于获取字符每次出现的文本的行号:
import requests, re
from collections import defaultdict
r = requests.get('https://www.gutenberg.org/cache/epub/2264/pg2264.txt').text
d, l, keywords = defaultdict(list), None, ['Enter', 'Exit', 'Flourish', 'Thunder']
#iterate over the play lines, ignoring empty strings (generated from the split)
for i, a in filter(lambda x:x[-1], enumerate(re.split('[\n\r]+', r[r.index('Actus Primus. Scoena Prima.')+27:]))):
#check that the line contains character dialog, not stage prompts
if not re.findall('|'.join(keywords), a):
#grab the name of the character and append to "d"
if (n:=re.findall('^\s+[A-Z](?:\.[A-Z])*[a-z]+\.(?=\s\w+)|^[A-Z](?:\.[A-Z])*[a-z\.]+\.(?=\s\w+)', a)):
d[(l:=re.sub('^\s+|\.$', '', n[0]).lower())].append((i, a[len(n[0])+1:].lower()))
elif l:
#the line might be a continuation of a larger block of character text
d[l].append((i, a.lower()))
print(list(d.keys())) #detected characters
print(d['macb'][:10]) #first ten occurrences of Macbeth speaking
输出:
['all', 'king', 'mal', 'cap', 'lenox', 'rosse', 'macb', 'banquo', 'mac', 'banq', 'ang', 'lady', 'mess', 'la', 'fleance', 'porter', 'macd', 'port', 'exeunt', 'ban', 'donal', 'malc', 'don', 'ross', 'seruant', 'murth', 'lords', 'mur', 'len', 'hec', 'lord', 'appar', 'musicke', 'wife', 'son', 'mes', 'doct', 'ro', 'gent', 'lad', 'ment', 'cath', 'ser', 'sey', 'seyw', 'sold', 'syw', 'y.sey']
[(137, 'so foule and faire a day i haue not seene'), (170, 'stay you imperfect speakers, tell me more:'), (171, 'by sinells death, i know i am thane of glamis,'), (172, 'but how, of cawdor? the thane of cawdor liues'), (173, 'a prosperous gentleman: and to be king,'), (174, 'stands not within the prospect of beleefe,'), (175, 'no more then to be cawdor. say from whence'), (176, 'you owe this strange intelligence, or why'), (177, 'vpon this blasted heath you stop our way'), (178, 'with such prophetique greeting?')]
编辑:每个字符的常用词:
要过滤每个字符的常用词,迭代 d
中每个字符的句子,然后再次迭代每个句子的 str.split
结果。需要注意的是,上一步的结果会包含很多stop words。我的以下解决方案为您提供了过滤这些选项的选项:
from collections import Counter
def common_words(character, filter_stop = False, stop_words = []):
if filter_stop:
stop_words = set(filter(None, requests.get("https://gist.githubusercontent.com/sebleier/554280/raw/7e0e4a1ce04c2bb7bd41089c9821dbcf6d0c786c/NLTK's%2520list%2520of%2520english%2520stopwords").text.split('\n')))
w = [i for _, b in d['Macb'] for i in re.sub('[\:\.\?]+', '', b).split() if i.lower() not in stop_words]
return Counter(w).most_common(5)
print(common_words('Macb', filter_stop=True))
输出:
[('haue', 39), ('thou', 34), ('thy', 23), ('shall', 21), ('thee', 20)]