如何从文本文件的每一行中剥离一段特定的文本？

Question

我已经下载了制表符分隔的带有英德句子对的 tatoeba 数据集，以在其上训练 NMT 模型。不幸的是，每一行都以各种附加信息结尾：

Go. Geh.    CC-BY 2.0 (France) Attribution: tatoeba.org #2877272 (CM) & #8597805 (Roujin)
Hi. Hallo!  CC-BY 2.0 (France) Attribution: tatoeba.org #538123 (CM) & #380701 (cburgmer)

如何去掉文本文件中每行第二句后面的部分？

我在 python 中尝试这样做：

for line in text:
  split = line.split('CC-BY', 1)
  line = split[0]

...但这没有用。我正在寻找的是一个如下所示的文件：

Go. Geh.
Hi. Hallo!

如有任何帮助，我将不胜感激:)

Answer 1

使用split的想法是正确的，但是在for循环中以这种方式直接赋值不会改变列表元素。

您还应该避免使用 split 作为变量名，因为它已经是内置方法的名称。

列表理解就可以完成这项工作：

new_lines = [line.split('CC-BY', 1)[0].strip() for line in text]

添加 strip 是因为您可能想删除每行末尾的多余空格。

将您的输入文本保存为text.txt，以下代码：

with open("text.txt", encoding="utf8") as f:
    text = f.read().splitlines()

new_lines = [line.split('CC-BY', 1)[0].strip() for line in text]

for line in new_lines:
    print(line)

给出输出：

Go. Geh.
Hi. Hallo!

Answer 2

我喜欢 Python，但我不会在 Python 中这样做。您可以 使用 bash 来分割前两列：

cut -f1 -f2 tatoeba.en.de.tsv

（假设文件名为tatoeba.en.de.tsv。）

将其通过管道传输到文件中：

cut -f1 -f2 tatoeba.en.de.tsv > tatoeba.en.de.stripped.tsv

相对于天真的 Python 方法的优势：

cut 更易于使用，可读性更强，代码更少。
cut 不会将整个文件加载到内存中，因此它可以处理非常大的文件。
> 只会将输出写入文件，而不是错误消息。

选项卡是核心 bash 实用程序中的第一个 class 公民这一事实是选择 TSV for machine translation data.

的一个重要原因

如果您真的想在 Python 中执行此操作，以便它适用于任何 Tatoeba 文件内容或大小：

在制表符上拆分并使用 slice 或切片表示法，not 在 CC-BY 这样的值上拆分并执行 not 条带
从迭代器读取，不将所有行读入一个对象

filename = sys.argv[1] # Pass the name of the file
with open(filename, 'r') as f:
    for line in f:
        source, target = line.split('\t')[:1] # "slice" the first 2 columns
        print(source, target, sep='\t')

如何从文本文件的每一行中剥离一段特定的文本？

How to strip a certain piece of text from each line of a text file?

python

text

strip

machine-translation