拆分句子，处理单词，然后将句子重新组合在一起？

Question

我有一个给单词打分的功能。我有很多文本，从句子到几页文档。我坚持如何对单词和 return 接近其原始状态的文本进行评分。

这是一个例句：

"My body lies over the ocean, my body lies over the sea."

我要制作的是：

"My body (2) lies over the ocean (3), my body (2) lies over the sea."

下面是我的评分算法的虚拟版本。我已经想出了如何获取文本，将其撕开并对其进行评分。

但是，我一直在思考如何将它重新组合成我需要的格式。

这是我的函数的虚拟版本：

def word_score(text):
    words_to_work_with = []
    words_to_return = []
    passed_text = TextBlob(passed_text)
    for word in words_to_work_with:
        word = word.singularize().lower()
        word = str(word)
        e_word_lemma = lemmatizer.lemmatize(word)
        words_to_work_with.append(e_word_lemma)
    for word in words to work with:
        if word == 'body':
            score = 2
        if word == 'ocean':
            score = 3
        else:
            score = None
        words_to_return.append((word,score))
    return words_to_return

我是一个相对的新手，所以我有两个问题：

我怎样才能把文字重新组合起来，
该逻辑应该放在函数中还是函数外？

我真的很想能够将整个片段（即句子、文档）输入到函数中并让它们 return。

谢谢你帮助我！

Answer 1

基本上，您想要为每个单词分配一个分数。您提供的功能可以使用 dictionary 而不是多个 if 语句来改进。此外，您还必须 return 所有分数，而不仅仅是 words_to_work_with 中第一个 word 的分数，这是函数的当前行为，因为它将 return 一个整数第一次迭代。所以新函数将是：

def word_score(text)
    words_to_work_with = []
    passed_text = TextBlob(text)
    for word in words_to_work_with:
        word = word.singularize().lower()
        word = str(word) # Is this line really useful ?
        e_word_lemma = lemmatizer.lemmatize(word)
        words_to_work_with.append(e_word_lemma)

    dict_scores = {'body' : 2, 'ocean' : 3, etc ...}
    return [dict_scores.get(word, None)] # if word is not recognized, score is None

对于重建字符串的第二部分，我实际上会在同一个函数中执行此操作（所以这回答了你的第二个问题）：

def word_score_and_reconstruct(text):
    words_to_work_with = []
    passed_text = TextBlob(text)

    reconstructed_text = ''

    for word in words_to_work_with:
        word = word.singularize().lower()
        word = str(word)  # Is this line really useful ?
        e_word_lemma = lemmatizer.lemmatize(word)
        words_to_work_with.append(e_word_lemma)

    dict_scores = {'body': 2, 'ocean': 3}
    dict_strings = {'body': ' (2)', 'ocean': ' (3)'}

    word_scores = []

    for word in words_to_work_with:
        word_scores.append(dict_scores.get(word, None)) # we still construct the scores list here

        # we add 'word'+'(word's score)', only if the word has a score
        # if not, we add the default value '' meaning we don't add anything
        reconstructed_text += word + dict_strings.get(word, '')

    return reconstructed_text, word_scores

我不保证此代码在第一次尝试时会起作用，我无法对其进行测试，但它会为您提供主要思想

Answer 2

希望这会有所帮助。根据你的问题，它对我有用。

此致问候！

"""
Python 3.7.2

Input:
Saved text in the file named as "original_text.txt"
My body lies over the ocean, my body lies over the sea. 
"""
input_file = open('original_text.txt', 'r') #Reading text from file
output_file = open('processed_text.txt', 'w') #saving output text in file

output_text = []

for line in input_file:
    words =  line.split()
    for word in words:
        if word == 'body':
            output_text.append('body (2)')
            output_file.write('body (2) ')
        elif word == 'body,':
            output_text.append('body (2),')
            output_file.write('body (2), ')
        elif word == 'ocean':
            output_text.append('ocean (3)')
            output_file.write('ocean (3) ')
        elif word == 'ocean,':
            output_text.append('ocean (3),')
            output_file.write('ocean (3), ')
        else:
            output_text.append(word)
            output_file.write(word+' ')

print (output_text)
input_file.close()
output_file.close()

Answer 3

这是一个有效的实现。该函数首先将输入文本解析为一个列表，这样每个列表元素都是一个单词或标点符号的组合（例如，一个逗号后跟一个 space。）处理完列表中的单词后，它将列表组合回一个字符串并 returns 它。

def word_score(text):
    words_to_work_with = re.findall(r"\b\w+|\b\W+",text)
    for i,word in enumerate(words_to_work_with):
        if word.isalpha():
            words_to_work_with[i] = inflection.singularize(word).lower()
            words_to_work_with[i] = lemmatizer.lemmatize(word)
            if word == 'body':
               words_to_work_with[i] = 'body (2)'
            elif word == 'ocean':
               words_to_work_with[i] = 'ocean (3)'
    return ''.join(words_to_work_with)

txt = "My body lies over the ocean, my body lies over the sea."
output = word_score(txt)
print(output)

输出：

My body (2) lie over the ocean (3), my body (2) lie over the sea.

如果你有 2 个以上的单词要评分，使用字典而不是 if 条件确实是个好主意。

拆分句子，处理单词，然后将句子重新组合在一起？

Split sentences, process words, and put sentence back together?

python

text

split

nltk

sentence