将 CSV 文件从 UTF-16 制表符转换为 UTF-8-SIG 逗号的最快方法
Fastest way to convert CSV files from UTF-16 tabs, to UTF-8-SIG commas
我有两个 30 GB 的 CSV 文件,每个文件包含数千万条记录。
他们使用逗号作为分隔符,并保存为 UTF-16(感谢 Tableau :-( )
我希望将这些文件转换为使用逗号而不是制表符的 utf-8-sig。
我试过下面的代码(一些变量声明较早):
csv_df = pd.read_csv(p, encoding=encoding, dtype=str, low_memory=False, error_bad_lines=True, sep=' ')
csv_df.to_csv(os.path.join(output_dir, os.path.basename(p)), encoding='utf-8-sig', index=False)
我也尝试过:
两者都工作得很慢,而且几乎永远不会完成。
有没有更好的转换方法?也许 Python 不是最好的工具?
理想情况下,我希望将数据存储在数据库中,但恐怕目前这不是一个合理的选择。
谢谢!
我在两分钟内转换了一个 35GB 的文件,请注意,您可以通过更改顶行中的常量来优化性能。
BUFFER_SIZE_READ = 1000000 # depends on available memory in bytes
MAX_LINE_WRITE = 1000 # number of lines to write at once
source_file_name = 'source_fle_name'
dest_file_name = 'destination_file_name'
source_encoding = 'file_source_encoding' # 'utf_16'
destination_encoding = 'file_destination_encoding' # 'utf_8'
BOM = True # True for utf_8_sig
lines_count = 0
def read_huge_file(file_name, encoding='utf_8', buffer_size=1000):
def read_buffer(file_obj, size=1000):
while True:
data = file_obj.read(size)
if data:
yield data
else:
break
source_file = open(file_name, encoding=encoding)
buffer_in = ''
for buffer in read_buffer(source_file, size=buffer_size):
buffer_in += buffer
lines = buffer_in.splitlines()
buffer_in = lines.pop()
if len(lines):
yield lines
else:
break
def process_data(data):
def write_lines(lines_to_write):
with open(dest_file_name, 'a', encoding=destination_encoding) as dest_file:
if BOM and dest_file.tell() == 0:
dest_file.write(u'\ufeff')
dest_file.write(lines_to_write)
return ''
global lines_count
lines = ''
for line in data:
lines_count += 1
lines += (line + '\n')
if not lines_count % MAX_LINE_WRITE:
lines = write_lines(lines)
if len(lines):
with open(dest_file_name, 'a', encoding=destination_encoding) as dest_file:
write_lines(lines)
for buffer_data in read_huge_file(source_file_name, encoding=source_encoding, buffer_size=BUFFER_SIZE_READ):
process_data(buffer_data)
我有两个 30 GB 的 CSV 文件,每个文件包含数千万条记录。
他们使用逗号作为分隔符,并保存为 UTF-16(感谢 Tableau :-( )
我希望将这些文件转换为使用逗号而不是制表符的 utf-8-sig。
我试过下面的代码(一些变量声明较早):
csv_df = pd.read_csv(p, encoding=encoding, dtype=str, low_memory=False, error_bad_lines=True, sep=' ')
csv_df.to_csv(os.path.join(output_dir, os.path.basename(p)), encoding='utf-8-sig', index=False)
我也尝试过:
两者都工作得很慢,而且几乎永远不会完成。
有没有更好的转换方法?也许 Python 不是最好的工具?
理想情况下,我希望将数据存储在数据库中,但恐怕目前这不是一个合理的选择。
谢谢!
我在两分钟内转换了一个 35GB 的文件,请注意,您可以通过更改顶行中的常量来优化性能。
BUFFER_SIZE_READ = 1000000 # depends on available memory in bytes
MAX_LINE_WRITE = 1000 # number of lines to write at once
source_file_name = 'source_fle_name'
dest_file_name = 'destination_file_name'
source_encoding = 'file_source_encoding' # 'utf_16'
destination_encoding = 'file_destination_encoding' # 'utf_8'
BOM = True # True for utf_8_sig
lines_count = 0
def read_huge_file(file_name, encoding='utf_8', buffer_size=1000):
def read_buffer(file_obj, size=1000):
while True:
data = file_obj.read(size)
if data:
yield data
else:
break
source_file = open(file_name, encoding=encoding)
buffer_in = ''
for buffer in read_buffer(source_file, size=buffer_size):
buffer_in += buffer
lines = buffer_in.splitlines()
buffer_in = lines.pop()
if len(lines):
yield lines
else:
break
def process_data(data):
def write_lines(lines_to_write):
with open(dest_file_name, 'a', encoding=destination_encoding) as dest_file:
if BOM and dest_file.tell() == 0:
dest_file.write(u'\ufeff')
dest_file.write(lines_to_write)
return ''
global lines_count
lines = ''
for line in data:
lines_count += 1
lines += (line + '\n')
if not lines_count % MAX_LINE_WRITE:
lines = write_lines(lines)
if len(lines):
with open(dest_file_name, 'a', encoding=destination_encoding) as dest_file:
write_lines(lines)
for buffer_data in read_huge_file(source_file_name, encoding=source_encoding, buffer_size=BUFFER_SIZE_READ):
process_data(buffer_data)