从文本文件编辑封装在标志中的数据
Editing data encapsulated in flags from text file
我目前正在清理文本文件中的数据。这些文件包含日常对话的演讲稿。有些文件是多语言的,多语言部分的一些例子是这样的:
around that area,<tamil>அம்மா:ammaa</tamil> would have cooked too
so at least need to <mandarin>跑两趟:pao liang tang</mandarin>,then I told them that it is fine
一个文件中可以有多种此类其他语言
回到第一个例子,我试图对数据做的是删除 "<tamil>"
、“அம்மா:”和 "</tamil>"
,只保留单词的英文发音。我试图将 <tamil>
替换为“”,但我不确定如何删除泰米尔语单词。
预期输出为:
around that area, ammaa would have cooked too
so at least need to pao liang tang,then I told them that it is fine
我该怎么做?
是的,请试试这个
content="around that area,<tamil>அம்மா:ammaa</tamil> would have cooked too"
ft=' '.join([word for line in [item.strip() for item in content.replace('<',' <').replace('>','> ').split('>') if not (item.strip().startswith('<') or (item.strip().startswith('&') and item.strip().endswith(';')))] for word in line.split() if not (word.strip().startswith('<') or (word.strip().startswith('&') and word.strip().endswith(';')))])
outputs=ft.encode('ascii','ignore')
print(outputs.decode('utf-8'))
输出
around that area, :ammaa would have cooked too
它不是完整的输出..就像如果你看到最后的字符串有一些额外的东西,比如“:”,一些标点符号..所以请使用正则表达式自己编辑它们..我已经发布了 99% 的答案
我目前正在清理文本文件中的数据。这些文件包含日常对话的演讲稿。有些文件是多语言的,多语言部分的一些例子是这样的:
around that area,<tamil>அம்மா:ammaa</tamil> would have cooked too
so at least need to <mandarin>跑两趟:pao liang tang</mandarin>,then I told them that it is fine
一个文件中可以有多种此类其他语言
回到第一个例子,我试图对数据做的是删除 "<tamil>"
、“அம்மா:”和 "</tamil>"
,只保留单词的英文发音。我试图将 <tamil>
替换为“”,但我不确定如何删除泰米尔语单词。
预期输出为:
around that area, ammaa would have cooked too
so at least need to pao liang tang,then I told them that it is fine
我该怎么做?
是的,请试试这个
content="around that area,<tamil>அம்மா:ammaa</tamil> would have cooked too"
ft=' '.join([word for line in [item.strip() for item in content.replace('<',' <').replace('>','> ').split('>') if not (item.strip().startswith('<') or (item.strip().startswith('&') and item.strip().endswith(';')))] for word in line.split() if not (word.strip().startswith('<') or (word.strip().startswith('&') and word.strip().endswith(';')))])
outputs=ft.encode('ascii','ignore')
print(outputs.decode('utf-8'))
输出
around that area, :ammaa would have cooked too
它不是完整的输出..就像如果你看到最后的字符串有一些额外的东西,比如“:”,一些标点符号..所以请使用正则表达式自己编辑它们..我已经发布了 99% 的答案