将包含韩文和英文字符的 Txt 文件读入 Python 3.5

Question

我正在尝试读取一个包含韩文和英文的 txt 文件。
这是一个 示例：
52:15 你好。 안녕하십니까.

我的代码是：

# Read a line and Split into tokens                                                                  
f = open(infile, 'r')
for line in f:
    if( matchObj = re.match( r"(\d\d:\d\d)\t([^\t]+)\t(.*)$", line )
        startTC, englishSubtitle, foreignSubtitle = matchObj.group(1), matchObj.group(2), matchObj.group(3)
    else:
        SyntaxError(line)

当我在 2012 Macbook Pro 运行 El Capitan 上将其读入 python (3.5) 时，我收到错误消息（在底部）。

错误信息：

python3 *.py
Traceback (most recent call last):
File "txtToSrt.py", line 48, in <module>
readFileData( "Korean.txt" )
File "txtToSrt.py", line 26, in readFileData
for line in f:
File "/Library/Frameworks/Python.framework/Versions/3.5/lib/python3.5/codecs.py", line 321, in decode
(result, consumed) = self._buffer_decode(data, self.errors, final)
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xfe in position 0: invalid start byte

能否请您建议如何阅读此内容。

Answer 1

我在顶部添加了以下行：

import codecs

并更改了读取文件的行，如下所示：

f = open(infile, 'r', encoding="utf-16")

现在可以读取数据，但不能写入文件。要写的代码是：

outfile = open("out.txt", 'w')
outfile.write( "{0}\n{1}\n".format(startTC, foreignSubtitle.encode("utf-16")) )

我得到的输出是：

01:00:01:16
b'\xff\xfe\x14\xbc\x98\xb0\x90\xc7'

我希望输出的第二行以韩语显示。我怎样才能做到这一点？谢谢

Answer 2

很遗憾 Python 韩元符号有问题。尝试以下操作以确认 python 3.5：

a_string = 'à'.encode ('utf-8')
print (a_string)

b_string = '₩'.encode ('utf-8')
print (b_string)

a_bytes = a_string.decode ('utf-8')
print (a_bytes)

b_bytes = b_string.decode ('utf-8')
print (b_bytes)

将包含韩文和英文字符的 Txt 文件读入 Python 3.5

Reading a Txt File into Python 3.5 with Korean and English characters in it

python

text

fonts

character-encoding

non-ascii-characters