数据生成 Python

Question

我正在尝试基于现有数据集生成数据集，我能够实现一种随机更改文件内容的方法，但我无法将所有这些写入文件。而且，我还需要将变化的单词数写入文件，因为我想用这个数据集来训练神经网络，你能帮帮我吗？

输入：每个文件有 2 行文本。

输出：有 3（可能）行的文件：第一行不变，第二行根据方法更改，第三行显示更改的单词数（如果对于深度学习任务最好不这样做） , 我很乐意提供建议，因为我是初学者)


<pre>from random import randrange
import os

Path = "D:\corrected data\"
filelist = os.listdir(Path)

if __name__ == "__main__":
    new_words = ['consultable', 'partie ', 'celle ', 'également ', 'forte ', 'statistiques ', 'langue ', 
'cadeaux', 'publications ', 'notre', 'nous', 'pour', 'suivr', 'les', 'vos', 'visitez ', 'thème ', 'thème  ', 'thème ', 'produits', 'coulisses ', 'un ', 'atelier ', 'concevoir  ', 'personnalisés  ', 'consultable', 'découvrir ', 'fournit ', 'trace ', 'dire ', 'tableau', 'décrire', 'grande ', 'feuille ', 'noter ', 'correspondant', 'propre',]
    nb_words_to_replace = randrange(10)

    #with open("1.txt") as file:
    for i in filelist:
       # if i.endswith(".txt"):  
            with open(Path + i,"r",encoding="utf-8") as file:
               # for line in file:
                    data = file.readlines()
                    first_line = data[0]
                    second_line = data[1]
                    print(f"Original: {second_line}")
                   # print(f"FIle: {file}")
                    second_line_array = second_line.split(" ")
                    for j in range(nb_words_to_replace):
                        replacement_position = randrange(len(second_line_array))

                        old_word = second_line_array[replacement_position]
                        new_word = new_words[randrange(len(new_words))]
                        print(f"Position {replacement_position} : {old_word} -> {new_word}")

                        second_line_array[replacement_position] = new_word

                    res = " ".join(second_line_array)
                    print(f"Result: {res}")
            with open(Path + i,"w") as f:
                       for line in file:
                          if line == second_line:
                                f.write(res)

Answer 1

总之，你有两个问题：

如何正确替换文件的第 2（和 3）行。
如何跟踪更改的字数。

如何正确替换文件的第 2（和 3）行。

您的代码：

with open(Path + i,"w") as f:
   for line in file:
      if line == second_line:
      f.write(res)

阅读未启用。 for line in file 将不起作用。 f 已定义，但改为使用 file。要解决此问题，请改为执行以下操作：

with open(Path + i,"r+") as file:
   lines = file.read().splitlines()    # splitlines() removes the \n characters
   lines[1] = second_line
   file.writelines(lines)

但是，您想向其中添加更多行。我建议你以不同的方式构建逻辑。

如何跟踪更改的字数。

添加变量 changed_words_count 并在 old_word != new_word

时增加它

结果代码：

for i in filelist:
    filepath = Path + i

    # The lines that will be replacing the file
    new_lines = [""] * 3
    
    with open(filepath, "r", encoding="utf-8") as file:
        data = file.readlines()
        first_line = data[0]
        second_line = data[1]
        
        second_line_array = second_line.split(" ")

        changed_words_count = 0
        for j in range(nb_words_to_replace):
            replacement_position = randrange(len(second_line_array))

            old_word = second_line_array[replacement_position]
            new_word = new_words[randrange(len(new_words))]

            # A word replaced does not mean the word has changed.
            # It could be replacing itself.
            # Check if the replacing word is different
            if old_word != new_word:
                changed_words_count += 1
            
            second_line_array[replacement_position] = new_word
        
        # Add the lines to the new file lines
        new_lines[0] = first_line
        new_lines[1] = " ".join(second_line_array)
        new_lines[2] = str(changed_words_count)
        
        print(f"Result: {new_lines[1]}")
    
    with open(filepath, "w") as file:
        file.writelines(new_lines)

注意：代码未经测试。

数据生成 Python

Data generation Python

python

database

recurrent-neural-network

如何正确替换文件的第 2（和 3）行。

如何跟踪更改的字数。

结果代码：