将未知编码的制表符分隔文本文件转换为 Python 中的 R 兼容文件编码

Question

我有很多编码未知的文本文件，我根本无法在 R 中打开，但我想在其中使用它们。在 UTF-16 中的 codecs 的帮助下，我最终能够在 python 中打开它们：

f = codecs.open(input,"rb","utf-16")
for line in f:
    print repr(line)

我的文件中的一行现在在 python 中打印时看起来像这样：

u'06/28/2016\t14:00:00\t0,000\t\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00
\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00
\x00\x00\x00\x00\t00000000\t6,000000\t0,000000\t144,600000\t12,050000
\t8,660000\t-120,100000\t-0,040000\t-0,110000\t1,340000\t5,360000
\t-1,140000\t-1,140000\t24,523000\t269,300000\t271,800000\t0,130000
\t272,000000\t177,000000\t0,765000\t0,539000\t\r\n'

开头的"u"告诉我这是unicode，但现在我真的不知道如何处理它。我的目标是将文本文件转换为我可以在 R 中使用的内容，例如正确编码的 csv，但我使用 unicodecsv:

失败

in_txt = unicodecsv.reader(f, delimiter = '\t', encoding = 'utf-8')
out_csv = unicodecsv.writer(open(output), 'wb', encoding = 'utf-8')

out_csv.writerows(in_txt)

谁能告诉我我方法中的主要错误是什么？

Answer 1

您可以从 R 中的 readr 包中尝试 guess_encoding(y)。它不是 100% 防弹，但它在过去对我有用，至少应该让你指出正确的方向：

guess_encoding(y)
#>     encoding confidence
#> 1 ISO-8859-2        0.4
#> 2 ISO-8859-1        0.3

尝试使用 read_tsv() 读入您的文件，然后尝试 guess_enconding()

希望对您有所帮助。

将未知编码的制表符分隔文本文件转换为 Python 中的 R 兼容文件编码

Converting tab delimited text file of unknown encoding to R-compatible file encoding in Python

encoding

r

file-conversion

python-2.7