将 CSV 文件从 UTF-16 制表符转换为 UTF-8-SIG 逗号的最快方法

Question

我有两个 30 GB 的 CSV 文件，每个文件包含数千万条记录。

他们使用逗号作为分隔符，并保存为 UTF-16（感谢 Tableau :-( )

我希望将这些文件转换为使用逗号而不是制表符的 utf-8-sig。

我试过下面的代码（一些变量声明较早）：

csv_df = pd.read_csv(p, encoding=encoding, dtype=str, low_memory=False, error_bad_lines=True, sep=' ')
csv_df.to_csv(os.path.join(output_dir, os.path.basename(p)), encoding='utf-8-sig', index=False)

我也尝试过：

两者都工作得很慢，而且几乎永远不会完成。

有没有更好的转换方法？也许 Python 不是最好的工具？

理想情况下，我希望将数据存储在数据库中，但恐怕目前这不是一个合理的选择。

谢谢！

Answer 1

我在两分钟内转换了一个 35GB 的文件，请注意，您可以通过更改顶行中的常量来优化性能。

BUFFER_SIZE_READ = 1000000  # depends on available memory in bytes
MAX_LINE_WRITE = 1000  # number of lines to write at once

source_file_name = 'source_fle_name'
dest_file_name = 'destination_file_name'
source_encoding = 'file_source_encoding'  # 'utf_16'
destination_encoding = 'file_destination_encoding'  # 'utf_8'
BOM = True  # True for utf_8_sig

lines_count = 0


def read_huge_file(file_name, encoding='utf_8', buffer_size=1000):
    def read_buffer(file_obj, size=1000):
        while True:
            data = file_obj.read(size)
            if data:
                yield data
            else:
                break

    source_file = open(file_name, encoding=encoding)
    buffer_in = ''
    for buffer in read_buffer(source_file, size=buffer_size):
        buffer_in += buffer
        lines = buffer_in.splitlines()
        buffer_in = lines.pop()
        if len(lines):
            yield lines
        else:
            break


def process_data(data):
    def write_lines(lines_to_write):
        with open(dest_file_name, 'a', encoding=destination_encoding) as dest_file:
            if BOM and dest_file.tell() == 0:
                dest_file.write(u'\ufeff')
            dest_file.write(lines_to_write)
        return ''

    global lines_count
    lines = ''
    for line in data:
        lines_count += 1
        lines += (line + '\n')
        if not lines_count % MAX_LINE_WRITE:
            lines = write_lines(lines)
    if len(lines):
        with open(dest_file_name, 'a', encoding=destination_encoding) as dest_file:
            write_lines(lines)


for buffer_data in read_huge_file(source_file_name, encoding=source_encoding, buffer_size=BUFFER_SIZE_READ):
    process_data(buffer_data)

将 CSV 文件从 UTF-16 制表符转换为 UTF-8-SIG 逗号的最快方法

Fastest way to convert CSV files from UTF-16 tabs, to UTF-8-SIG commas

python

utf-8

pandas

tableau-desktop