拆分句子,处理单词,然后将句子重新组合在一起?
Split sentences, process words, and put sentence back together?
我有一个给单词打分的功能。我有很多文本,从句子到几页文档。我坚持如何对单词和 return 接近其原始状态的文本进行评分。
这是一个例句:
"My body lies over the ocean, my body lies over the sea."
我要制作的是:
"My body (2) lies over the ocean (3), my body (2) lies over the sea."
下面是我的评分算法的虚拟版本。我已经想出了如何获取文本,将其撕开并对其进行评分。
但是,我一直在思考如何将它重新组合成我需要的格式。
这是我的函数的虚拟版本:
def word_score(text):
words_to_work_with = []
words_to_return = []
passed_text = TextBlob(passed_text)
for word in words_to_work_with:
word = word.singularize().lower()
word = str(word)
e_word_lemma = lemmatizer.lemmatize(word)
words_to_work_with.append(e_word_lemma)
for word in words to work with:
if word == 'body':
score = 2
if word == 'ocean':
score = 3
else:
score = None
words_to_return.append((word,score))
return words_to_return
我是一个相对的新手,所以我有两个问题:
- 我怎样才能把文字重新组合起来,
- 该逻辑应该放在函数中还是函数外?
我真的很想能够将整个片段(即句子、文档)输入到函数中并让它们 return。
谢谢你帮助我!
基本上,您想要为每个单词分配一个分数。您提供的功能可以使用 dictionary 而不是多个 if
语句来改进。
此外,您还必须 return 所有分数,而不仅仅是 words_to_work_with
中第一个 word
的分数,这是函数的当前行为,因为它将 return 一个整数第一次迭代。
所以新函数将是:
def word_score(text)
words_to_work_with = []
passed_text = TextBlob(text)
for word in words_to_work_with:
word = word.singularize().lower()
word = str(word) # Is this line really useful ?
e_word_lemma = lemmatizer.lemmatize(word)
words_to_work_with.append(e_word_lemma)
dict_scores = {'body' : 2, 'ocean' : 3, etc ...}
return [dict_scores.get(word, None)] # if word is not recognized, score is None
对于重建字符串的第二部分,我实际上会在同一个函数中执行此操作(所以这回答了你的第二个问题):
def word_score_and_reconstruct(text):
words_to_work_with = []
passed_text = TextBlob(text)
reconstructed_text = ''
for word in words_to_work_with:
word = word.singularize().lower()
word = str(word) # Is this line really useful ?
e_word_lemma = lemmatizer.lemmatize(word)
words_to_work_with.append(e_word_lemma)
dict_scores = {'body': 2, 'ocean': 3}
dict_strings = {'body': ' (2)', 'ocean': ' (3)'}
word_scores = []
for word in words_to_work_with:
word_scores.append(dict_scores.get(word, None)) # we still construct the scores list here
# we add 'word'+'(word's score)', only if the word has a score
# if not, we add the default value '' meaning we don't add anything
reconstructed_text += word + dict_strings.get(word, '')
return reconstructed_text, word_scores
我不保证此代码在第一次尝试时会起作用,我无法对其进行测试,但它会为您提供主要思想
希望这会有所帮助。根据你的问题,它对我有用。
此致问候!
"""
Python 3.7.2
Input:
Saved text in the file named as "original_text.txt"
My body lies over the ocean, my body lies over the sea.
"""
input_file = open('original_text.txt', 'r') #Reading text from file
output_file = open('processed_text.txt', 'w') #saving output text in file
output_text = []
for line in input_file:
words = line.split()
for word in words:
if word == 'body':
output_text.append('body (2)')
output_file.write('body (2) ')
elif word == 'body,':
output_text.append('body (2),')
output_file.write('body (2), ')
elif word == 'ocean':
output_text.append('ocean (3)')
output_file.write('ocean (3) ')
elif word == 'ocean,':
output_text.append('ocean (3),')
output_file.write('ocean (3), ')
else:
output_text.append(word)
output_file.write(word+' ')
print (output_text)
input_file.close()
output_file.close()
这是一个有效的实现。该函数首先将输入文本解析为一个列表,这样每个列表元素都是一个单词或标点符号的组合(例如,一个逗号后跟一个 space。)处理完列表中的单词后,它将列表组合回一个字符串并 returns 它。
def word_score(text):
words_to_work_with = re.findall(r"\b\w+|\b\W+",text)
for i,word in enumerate(words_to_work_with):
if word.isalpha():
words_to_work_with[i] = inflection.singularize(word).lower()
words_to_work_with[i] = lemmatizer.lemmatize(word)
if word == 'body':
words_to_work_with[i] = 'body (2)'
elif word == 'ocean':
words_to_work_with[i] = 'ocean (3)'
return ''.join(words_to_work_with)
txt = "My body lies over the ocean, my body lies over the sea."
output = word_score(txt)
print(output)
输出:
My body (2) lie over the ocean (3), my body (2) lie over the sea.
如果你有 2 个以上的单词要评分,使用字典而不是 if
条件确实是个好主意。
我有一个给单词打分的功能。我有很多文本,从句子到几页文档。我坚持如何对单词和 return 接近其原始状态的文本进行评分。
这是一个例句:
"My body lies over the ocean, my body lies over the sea."
我要制作的是:
"My body (2) lies over the ocean (3), my body (2) lies over the sea."
下面是我的评分算法的虚拟版本。我已经想出了如何获取文本,将其撕开并对其进行评分。
但是,我一直在思考如何将它重新组合成我需要的格式。
这是我的函数的虚拟版本:
def word_score(text):
words_to_work_with = []
words_to_return = []
passed_text = TextBlob(passed_text)
for word in words_to_work_with:
word = word.singularize().lower()
word = str(word)
e_word_lemma = lemmatizer.lemmatize(word)
words_to_work_with.append(e_word_lemma)
for word in words to work with:
if word == 'body':
score = 2
if word == 'ocean':
score = 3
else:
score = None
words_to_return.append((word,score))
return words_to_return
我是一个相对的新手,所以我有两个问题:
- 我怎样才能把文字重新组合起来,
- 该逻辑应该放在函数中还是函数外?
我真的很想能够将整个片段(即句子、文档)输入到函数中并让它们 return。
谢谢你帮助我!
基本上,您想要为每个单词分配一个分数。您提供的功能可以使用 dictionary 而不是多个 if
语句来改进。
此外,您还必须 return 所有分数,而不仅仅是 words_to_work_with
中第一个 word
的分数,这是函数的当前行为,因为它将 return 一个整数第一次迭代。
所以新函数将是:
def word_score(text)
words_to_work_with = []
passed_text = TextBlob(text)
for word in words_to_work_with:
word = word.singularize().lower()
word = str(word) # Is this line really useful ?
e_word_lemma = lemmatizer.lemmatize(word)
words_to_work_with.append(e_word_lemma)
dict_scores = {'body' : 2, 'ocean' : 3, etc ...}
return [dict_scores.get(word, None)] # if word is not recognized, score is None
对于重建字符串的第二部分,我实际上会在同一个函数中执行此操作(所以这回答了你的第二个问题):
def word_score_and_reconstruct(text):
words_to_work_with = []
passed_text = TextBlob(text)
reconstructed_text = ''
for word in words_to_work_with:
word = word.singularize().lower()
word = str(word) # Is this line really useful ?
e_word_lemma = lemmatizer.lemmatize(word)
words_to_work_with.append(e_word_lemma)
dict_scores = {'body': 2, 'ocean': 3}
dict_strings = {'body': ' (2)', 'ocean': ' (3)'}
word_scores = []
for word in words_to_work_with:
word_scores.append(dict_scores.get(word, None)) # we still construct the scores list here
# we add 'word'+'(word's score)', only if the word has a score
# if not, we add the default value '' meaning we don't add anything
reconstructed_text += word + dict_strings.get(word, '')
return reconstructed_text, word_scores
我不保证此代码在第一次尝试时会起作用,我无法对其进行测试,但它会为您提供主要思想
希望这会有所帮助。根据你的问题,它对我有用。
此致问候!
"""
Python 3.7.2
Input:
Saved text in the file named as "original_text.txt"
My body lies over the ocean, my body lies over the sea.
"""
input_file = open('original_text.txt', 'r') #Reading text from file
output_file = open('processed_text.txt', 'w') #saving output text in file
output_text = []
for line in input_file:
words = line.split()
for word in words:
if word == 'body':
output_text.append('body (2)')
output_file.write('body (2) ')
elif word == 'body,':
output_text.append('body (2),')
output_file.write('body (2), ')
elif word == 'ocean':
output_text.append('ocean (3)')
output_file.write('ocean (3) ')
elif word == 'ocean,':
output_text.append('ocean (3),')
output_file.write('ocean (3), ')
else:
output_text.append(word)
output_file.write(word+' ')
print (output_text)
input_file.close()
output_file.close()
这是一个有效的实现。该函数首先将输入文本解析为一个列表,这样每个列表元素都是一个单词或标点符号的组合(例如,一个逗号后跟一个 space。)处理完列表中的单词后,它将列表组合回一个字符串并 returns 它。
def word_score(text):
words_to_work_with = re.findall(r"\b\w+|\b\W+",text)
for i,word in enumerate(words_to_work_with):
if word.isalpha():
words_to_work_with[i] = inflection.singularize(word).lower()
words_to_work_with[i] = lemmatizer.lemmatize(word)
if word == 'body':
words_to_work_with[i] = 'body (2)'
elif word == 'ocean':
words_to_work_with[i] = 'ocean (3)'
return ''.join(words_to_work_with)
txt = "My body lies over the ocean, my body lies over the sea."
output = word_score(txt)
print(output)
输出:
My body (2) lie over the ocean (3), my body (2) lie over the sea.
如果你有 2 个以上的单词要评分,使用字典而不是 if
条件确实是个好主意。