从文本文件编辑封装在标志中的数据

Question

我目前正在清理文本文件中的数据。这些文件包含日常对话的演讲稿。有些文件是多语言的，多语言部分的一些例子是这样的：

around that area,<tamil>அம்மா:ammaa</tamil> would have cooked too

so at least need to <mandarin>跑两趟:pao liang tang</mandarin>,then I told them that it is fine

一个文件中可以有多种此类其他语言

回到第一个例子，我试图对数据做的是删除 "<tamil>"、“அம்மா:”和 "</tamil>"，只保留单词的英文发音。我试图将 <tamil> 替换为“”，但我不确定如何删除泰米尔语单词。

预期输出为：

around that area, ammaa would have cooked too

so at least need to pao liang tang,then I told them that it is fine

我该怎么做？

Answer 1

是的，请试试这个

content="around that area,<tamil>அம்மா:ammaa</tamil> would have cooked too"

ft=' '.join([word for line in [item.strip() for item in content.replace('<',' <').replace('>','> ').split('>') if not (item.strip().startswith('<') or (item.strip().startswith('&') and item.strip().endswith(';')))] for word in line.split() if not (word.strip().startswith('<') or (word.strip().startswith('&') and word.strip().endswith(';')))])
outputs=ft.encode('ascii','ignore')

print(outputs.decode('utf-8'))

输出

around that area, :ammaa would have cooked too

它不是完整的输出..就像如果你看到最后的字符串有一些额外的东西，比如“:”，一些标点符号..所以请使用正则表达式自己编辑它们..我已经发布了 99% 的答案

从文本文件编辑封装在标志中的数据

Editing data encapsulated in flags from text file

python

data-cleaning