使用 python 去除一堆文件中的短语

Question

我有 29 个 .srt 文件。它们都包含 HTML 代码，例如 <font color="#E5E5E5">、<font color="#CCCCCC"> 和 </font>。我想从这 29 个文件中删除所有这些 HTML 代码。但我不知道如何立即完成。我已经在问题中附上了我现在正在使用的代码。但它一次只能更改一个文件，它给我留下了 3 个无用的文件。谁能帮我解决这个问题？

    import re

    string = open('/Users/Cynthia/Desktop/Jeunesse/Longivity English/Jeunesse 
    Longevity TV - Episode 27 - Lifestyle - PART 4 - Healthy Nutrition 2 
    2.en.transcribed.txt').read()
    new_str = re.sub('<font color="#CCCCCC">', ' ', string)
    open('b.txt', 'w').write(new_str)

    string = open('/Users/Cynthia/Desktop/Jeunesse/Longivity 
    English/b.txt').read()
    new_str = re.sub('<font color="#E5E5E5">', ' ', string)
    open('c.txt', 'w').write(new_str)

    string = open('/Users/Cynthia/Desktop/Jeunesse/Longivity 
    English/c.txt').read()
    new_str = re.sub('</font>', ' ', string)
    open('d.txt', 'w').write(new_str)

Answer 1

这是使用函数处理一个文件的初学者友好方式。它链接您的代码并将结果写入新文件。

因此您只需要为每个文件调用一次strip_html，并使用文件名和新文件的名称。

在这个例子中有一个文件名列表，它会写入最后带有单词“.fixed”的固定文件。

请注意，这是一种简单的方法，为了便于理解，我省略了很多内容。一旦你对编程有了更多的了解，你就会找到更好的方法。但是你应该让它工作。

import re

def strip_html(filename, newfilename):
    with open(filename) as f1:
        string = f1.read()
        new_str = re.sub('<font color="#CCCCCC">', ' ', string)
        new_str = re.sub('<font color="#E5E5E5">', ' ', new_str)
        new_str = re.sub('</font>', ' ', new_str)
        with open(newfilename, 'w') as w1:
            w1.write(new_str)

files = ['/Users/Cynthia/Desktop/Jeunesse/Longivity English/Jeunesse 
Longevity TV - Episode 27 - Lifestyle - PART 4 - Healthy Nutrition 2 
2.en.transcribed.txt',
'/Users/Cynthia/Desktop/Jeunesse/Longivity English/Jeunesse 
Longevity TV - Episode 28 - Lifestyle - PART 1 - Healthy Nutrition 3 
2.en.transcribed.txt'
]

for file in files:  
    strip_html(file, file + '.fixed')

希望对您有所帮助。

当你把这个写到运行时，请查看 os.listdir() 命令以了解如何从目录中获取文件名列表，而不是将它们写在代码中。

使用 python 去除一堆文件中的短语

Using python to strip phrase in a bunch of files

python

strip