Python |一致地重新格式化文本文件中的每一行

Question

我制作了自己的语料库拼错单词。

misspellings_corpus.txt:

English, enlist->Enlish
Hallowe'en, Halloween->Hallowean

我的格式有问题。值得庆幸的是，它至少是一致的。

当前格式：

correct, wrong1, wrong2->wrong3

所需格式：

wrong1,wrong2,wrong3->correct

wrong<N> 的顺序无关紧要，
每行可能有任意数量的 wrong<N> 个单词（用逗号分隔：,），
每行只有 1 个 correct 字（应该在 -> 的右边）。

尝试失败：

with open('misspellings_corpus.txt') as oldfile, open('new.txt', 'w') as newfile:
    for line in oldfile:
      correct = line.split(', ')[0].strip()
      print(correct)
      W = line.split(', ')[1].strip()
      print(W)
      wrong_1 = W.split('->')[0] # however, there might be loads of wrong words
      wrong_2 = W.split('->')[1]
      newfile.write(wrong_1 + ', ' + wrong_2 + '->' + correct)

输出new.txt（不工作）：

enlist, Enlish->EnglishHalloween, Hallowean->Hallowe'en

解决方案：（灵感来自@alexis）

with open('misspellings_corpus.txt') as oldfile, open('new.txt', 'w') as newfile:
  for line in oldfile:
    #line = 'correct, wrong1, wrong2->wrong3'
    line = line.strip()
    terms = re.split(r", *|->", line)
    newfile.write(",".join(terms[1:]) + "->" + terms[0] + '\n')

输出new.txt:

enlist,Enlish->English
Halloween,Hallowean->Hallowe'en

Answer 1

我们假设所有逗号都是单词分隔符。为方便起见，我将用逗号和箭头分隔每一行：

import re

line = 'correct, wrong1, wrong2->wrong3'
terms = re.split(r", *|->", line)
new_line = ", ".join(terms[1:]) + "->" + terms[0]
print(new_line)

你可以把它放回文件读取循环中，对吗？

Answer 2

我建议建立一个列表，而不是假设元素的数量。当您在逗号上拆分时，第一个元素是正确的单词，元素 [1:-1] 是拼写错误，而 [-1] 将是您必须在箭头上拆分的那个。

我认为您还发现 write 需要一个换行符，如评论中建议的“\n”。

Python |一致地重新格式化文本文件中的每一行

Python | Reformatting each line in a text file consistently

python

text

list

slice

python-3.x