从字幕文件中删除不在单词列表中的单词（常用单词）

Question

我有一些字幕文件，我不打算学习这些字幕中的每一个单词，没有必要学习一些硬性术语，例如：cleidocranial，dysplasia...

我在这里找到了这个脚本：Remove words from a cell that aren't in a list。但我不知道如何修改它或运行它。（我正在使用 linux）

这是我们的例子：

字幕文件(.srt):

2
00:00:13,000 --> 00:00:15,000
People with cleidocranial dysplasia are good.

3000个常用词的词表(.txt):

...
people
with
are
good
...

我们需要的输出 (.srt):

2
00:00:13,000 --> 00:00:15,000
People with * * are good.

或者在可能的情况下标记它们 (.srt):

2
00:00:13,000 --> 00:00:15,000
People with cleidocranial* dysplasia* are good.

如果有只使用纯文本（没有时间码）的解决方案，没关系，只需解释如何运行它
谢谢。

Answer 1

以下仅处理每个 '.srt' 文件的第 3 行。它可以很容易地适应处理其他行 and/or 其他文件。

import os
import re
from glob import glob

with open('words.txt') as f:
    keep_words = {line.strip().lower() for line in f}

for filename_in in glob('*.srt'):
    filename_out = f'{os.path.splitext(filename_in)[0]}_new.srt'
    with open(filename_in) as fin, open(filename_out, 'w') as fout:
        for i, line in enumerate(fin):
            if i == 2:
                parts = re.split(r"([\w']+)", line.strip())
                parts[1::2] = [w if w.lower() in keep_words else '*' for w in parts[1::2]]
                line = ''.join(parts) + '\n'
            fout.write(line)

结果（对于您给出的 subtitle.rst 示例：

! cat subtitle_new.rst
2
00:00:13,000 --> 00:00:15,000
People with * * are good.

备选方案：只需在词汇外单词旁边添加一个 '*'：

# replace:
#                 parts[1::2] = [w if w.lower() in keep_words else '*' for w in parts[1::2]]
                parts[1::2] = [w if w.lower() in keep_words else f'{w}*' for w in parts[1::2]]

然后输出是：

2
00:00:13,000 --> 00:00:15,000
People with cleidocranial* dysplasia* are good.

解释：

第一个open用于读入所有想要的单词，确保它们是小写的，然后将它们放入set（用于快速成员测试）。
我们使用 glob 查找所有以 '.srt' 结尾的文件名。
对于每个这样的文件，我们构造一个从中导出的新文件名 '..._new.srt'。
我们读入所有行，但只修改第 i == 2 行（即第 3 行，因为 enumerate 默认从 0 开始）。
line.strip() 删除结尾的换行符。
我们可以使用 line.strip().split() 将行拆分为单词，但它会将 'good.' 作为最后一个单词；不好。所用的正则常用于分词（特别是在单引号中留下，如"don't"；可能是也可能不是你想要的，当然随意改编）。
我们使用捕获组拆分 r"([\w']+)" 而不是拆分非单词字符，因此我们在 parts 中同时拥有单词和分隔它们的内容。例如，'People, who are good.' 变为 ['', 'People', ', ', 'who', ' ', 'are', ' ', 'good', '.']。
单词本身是 parts 的所有其他元素，从索引 1 开始。
如果单词的小写形式不在 keep_words 中，我们将其替换为 '*'。
最后我们重新[=68=]那一行，一般把所有行都输出到新文件中。

Answer 2

你可以简单地运行像这样的 python 脚本：

with open("words.txt", "rt") as words:
    #create a list with every word
    wordList = words.read().split("\n")

with open("subtitle.srt", "rt") as subtitles:
    with open("subtitle_output.srt", "wt") as out:
        for line in subtitles.readlines():
            if line[0].isdigit():
                #ignore the line as it starts with a digit
                out.write(line)
                continue
            else:
                for word in line.split():
                    if not word in wordList:
                        out.write(line.replace(word, f"*{word}*"))

此脚本将用修改后的 *word* 替换常用词文件中不存在的每个词，保留原始文件并将所有内容放入新的输出文件中

从字幕文件中删除不在单词列表中的单词（常用单词）

Remove words from a subtitle file that aren't in a wordlist (of common words)

python

grep

text

subtitle