我的 for 循环与 yield 结合的问题

Question

我有一个程序可以连接用星号分隔的单词。该程序删除星号并将单词的第一部分（星号之前的部分）与其第二部分（星号之后的部分）连接起来。除了一个主要问题外，它运行良好：第二部分（在星号之后）仍在输出中。例如，拼接 ['presi', '*', 'dent'] 的程序，但 'dent' 仍在输出中。我没有弄清楚我的代码哪里有问题。代码如下：

from collections import defaultdict
import nltk
from nltk.tokenize import word_tokenize
import re
import os
import sys
from pathlib import Path


def main():
    while True:
        try:
            file_to_open =Path(input("\nPlease, insert your file path: "))

            with open(file_to_open) as f:
                words = word_tokenize(f.read().lower())
                break
        except FileNotFoundError:
            print("\nFile not found. Better try again")
        except IsADirectoryError:
            print("\nIncorrect Directory path.Try again")

    word_separator = '*'

    with open ('Fr-dictionary2.txt') as fr:
            dic = word_tokenize(fr.read().lower())

    def join_asterisk(ary):

        for w1, w2, w3 in zip(words, words[1:], words[2:]):
            if w2 == word_separator:
                word = w1 + w3
                yield (word, word in dic)
            elif w1 != word_separator and w1 in dic:
                yield (w1, True)


    correct_words = []
    incorrect_words = []
    correct_words = [w for w, correct in join_asterisk(words) if correct]
    incorrect_words = [w for w, correct in join_asterisk(words) if not correct]
    text=' '.join(correct_words)
    print(correct_words)
    print('\n\n', text)
    user2=input('\nWrite text to a file? Type "Y" for yes or "N" for no:')

    text_name=input("name your file.(Ex. 'my_first_file.txt'): ")
    out_file=open(text_name,"w")

    if user2 =='Y':
        out_file.write(text)
        out_file.close()
    else:
        print('ok')


main()

不知道有没有人能帮我查出这里的错误？

输入示例：

Les engage * ments du prési * dent de la Républi * que sont aussi ceux des dirigeants de la société » ferroviaire, a-t-il soutenu de vant des élus du Grand-Est réunis à l’Elysée.

Le président de la République, Emmanuel Macron (à droite), aux cô * tés du patron de la SNCF, Guillaume Pepy, à la gare Montparnasse, à Paris, le 1er juillet 2017. GEOFFROY VAN DER HASSELT / AFP

L’irrita tion qui, par fois, s’empare des usa * gers de la SNCF face aux trains suppri * més ou aux dessertes abandonnées semble avoir aussi saisi le président de la République. Devant des élus du Grand-Est, réunis mardi 26 février à l’Elysée dans le cadre du grand débat, Emmanuel Macron a eu des mots très durs contre la SNCF, qui a fermé la ligne Saint-Dié - Epinal le 23 décembre 2018, alors que le chef de l’Etat s’était engagé, durant un dépla * cement dans les Vosges effec * tué en avril 2018, à ce qu’elle reste opération * nelle.

我当前的输出示例是：

['les', 'engagements', 'du', 'président', 'dent', 'de', 'la', 'république', 'que', 'sont', 'aussi', 'ceux', 'des', 'dirigeants', 'de', 'la', 'société', 'ferroviaire']

我想要的输出示例是：

['les', 'engagements', 'du', 'président', 'de', 'la', 'république', 'sont', 'aussi', 'ceux', 'des', 'dirigeants', 'de', 'la', 'société', 'ferroviaire']

Answer 1

这两个额外的词（我假设）都在你的字典中，因此在 for 循环的 2 次迭代后第二次产生，因为它们在行中变为 w1 时遇到这种情况：

            elif w1 != word_separator and w1 in dic:
                yield (w1, True)

重新设计您的 join_asterisk 函数似乎是实现此目的的最佳方式，因为任何试图修改此函数以跳过这些函数的尝试都将是非常骇人听闻的。

以下是重新设计函数的方法，这样您就可以跳过已经包含在由“*”分隔的单词的后半部分的单词：

incorrect_words = []
def join_asterisk(array):
    ary = array + ['', '']
    i, size = 0, len(ary)
    while i < size - 2:
        if ary[i+1] == word_separator:
            if ary[i] + ary[i+2] in dic:
                yield ary[i] + ary[i+2]
            else:
                incorrect_words.append(ary[i] + ary[i+2])
            i+=2
        elif ary[i] in dic: 
            yield ary[i]
        i+=1

如果你想让它更贴近你原来的功能，可以修改为：

def join_asterisk(array):
    ary = array + ['', '']
    i, size = 0, len(ary)
    while i < size - 2:
        if ary[i+1] == word_separator:
            concat_word = ary[i] + ary[i+2]
            yield (concat_word, concat_word in dic)
            i+=2
        else: 
            yield (ary[i], ary[i] in dic)
        i+=1

Answer 2

我认为 join_asterisk 的替代实现符合您的意图：

def join_asterisk(words, word_separator):
    if not words:
        return
    # Whether the previous word was a separator
    prev_sep = (words[0] == word_separator)
    # Next word to yield
    current = words[0] if not prev_sep else ''
    # Iterate words
    for word in words[1:]:
        # Skip separator
        if word == word_separator:
            prev_sep = True
        else:
            # If neither this or the previous were separators
            if not prev_sep:
                # Yield current word and clear
                yield current
                current = ''
            # Add word to current
            current += word
            prev_sep = False
    # Yield last word if list did not finish with a separator
    if not prev_sep:
        yield current

words = ['les', 'engagements', 'du', 'prési', '*', 'dent', 'de', 'la', 'républi', '*', 'que', 'sont', 'aussi', 'ceux', 'des', 'dirigeants', 'de', 'la', 'société', 'ferroviaire']
word_separator = '*'
print(list(join_asterisk(words, word_separator)))
# ['les', 'engagements', 'du', 'président', 'de', 'la', 'république', 'sont', 'aussi', 'ceux', 'des', 'dirigeants', 'de', 'la', 'société', 'ferroviaire']

我的 for 循环与 yield 结合的问题

problem with my for-loop combined with yield

python

for-loop

yield

nltk