合并 1000 多个文件时如何将 csv 文件的名称添加为列中的值?
How to add name of csv files as values in a column while merging 1000+ files?
我正在尝试使用以下代码合并 1000 多个 csv 文件:
path = r'path_to_files/'
all_files = glob.glob(path + "/*.csv")
import shutil
with open('updated_thirteen_jan.csv','wb') as wfd:
for f in all_files:
with open(f,'rb') as fd:
shutil.copyfileobj(fd, wfd)
我正在使用上面的代码来避免 ram 崩溃问题,它工作正常。但是,我想执行以下代码对我的作用:
path = r'path_to_files/'
all_files = glob.glob(path + "/*.csv")
fields = ['col1', 'col2', 'col3', 'col4', 'col5', 'col6', 'col7', 'col8']
li = []
first_one = True
for filename in all_files:
if not first_one: # if it is not the first csv file then skip the header row (row 0) of that file
skip_row = [0]
else:
skip_row = []
for filename in all_files:
df = pd.read_csv(filename, index_col=None, skiprows = skip_row, engine='python', usecols=fields)
df = df[(df['lang'] == 'en')]
filename = os.path.basename(filename)
df['file_name'] = filename
li.append(df)
frame = pd.concat(li, axis=0, ignore_index=True)
从这段代码中,我希望能够执行列选择 fileds
、row_skip
并添加 file_name
作为值。
有什么指导吗?
如果内存是限制,那么一个基于 pandas
的解决方案是 iterate over chunks of rows:
import os
import pandas as pd
print(pd.__version__)
# works with this version: '1.3.4'
# gen sample files
all_files = [f"{_}.csv" for _ in range(3)]
for filename in all_files:
df = pd.DataFrame(range(3))
df.to_csv(filename, index=False)
# combine into one
mode = "w"
header = True
for filename in all_files:
with pd.read_csv(
filename,
engine="python",
iterator=True,
chunksize=10_000,
) as reader:
for df in reader:
filename = os.path.basename(filename)
df["file_name"] = filename
df.to_csv("some_file.csv", index=False, mode=mode, header=header)
mode = "a"
header = False
另一个解决方案是使用 dask
:
# pip install dask
import dask.dataframe as dd
# dd.read_csv is mostly compatible with pd.read_csv options
# so can specify reading specific columns, etc.
ddf = dd.read_csv("some_path/*.csv")
ddf.to_csv('merged_file.csv', index=False, single_file=True)
旧的 csv
模块一次可以处理一行,因此内存不会成为问题。以下代码将连接仅保留第一个 header 的 csv 文件,并添加一个用文件名填充的文件名列。
path = r'path_to_files/'
all_files = glob.glob(path + "/*.csv")
import csv
with open('updated_thirteen_jan.csv','w', newline='') as wfd:
wr = csv.writer(wfd)
first = True
for f in all_files:
with open(f) as fd:
rd = csv.reader(fd)
# skip header line, except for the first file
row = next(rd)
if first:
row.append('filename')
wr.writerow(row)
first = False
for row in rd:
row.append(f)
wr.writerow(row)
一次将一个文件读入 pandas 数据帧,向其添加新列并将其写入新文件。
import os
import glob
import pathlib
path = 'path_to_files/'
out_file = 'updated_thirteen_jan.csv'
all_files = glob.glob(path + '*.csv')
all_files = sorted([pathlib.Path(i) for i in all_files])
keep_cols = ['list', 'of', 'columns', 'to', 'keep']
skip_row = 2 # number of rows to skip
for fn in all_files:
temp = pd.read_csv(fn, usecols=keep_cols, skiprows=skip_row)
temp['filename'] = fn.stem
temp.to_csv(out_file, mode='a', index=False, header=not os.path.isfile(out_file))
如果将整个 csv 读入内存不可行,则使用 chunksize。根据您的机器容量修改此值。
for fn in all_files:
reader = pd.read_csv(fn, usecols=keep_cols, skiprows=skip_row, chunksize=5000)
for idx, df in enumerate(reader):
df['filename'] = fn.stem
df.to_csv(out_file, mode='a', index=False, header=not os.path.isfile(out_file))
我正在尝试使用以下代码合并 1000 多个 csv 文件:
path = r'path_to_files/'
all_files = glob.glob(path + "/*.csv")
import shutil
with open('updated_thirteen_jan.csv','wb') as wfd:
for f in all_files:
with open(f,'rb') as fd:
shutil.copyfileobj(fd, wfd)
我正在使用上面的代码来避免 ram 崩溃问题,它工作正常。但是,我想执行以下代码对我的作用:
path = r'path_to_files/'
all_files = glob.glob(path + "/*.csv")
fields = ['col1', 'col2', 'col3', 'col4', 'col5', 'col6', 'col7', 'col8']
li = []
first_one = True
for filename in all_files:
if not first_one: # if it is not the first csv file then skip the header row (row 0) of that file
skip_row = [0]
else:
skip_row = []
for filename in all_files:
df = pd.read_csv(filename, index_col=None, skiprows = skip_row, engine='python', usecols=fields)
df = df[(df['lang'] == 'en')]
filename = os.path.basename(filename)
df['file_name'] = filename
li.append(df)
frame = pd.concat(li, axis=0, ignore_index=True)
从这段代码中,我希望能够执行列选择 fileds
、row_skip
并添加 file_name
作为值。
有什么指导吗?
如果内存是限制,那么一个基于 pandas
的解决方案是 iterate over chunks of rows:
import os
import pandas as pd
print(pd.__version__)
# works with this version: '1.3.4'
# gen sample files
all_files = [f"{_}.csv" for _ in range(3)]
for filename in all_files:
df = pd.DataFrame(range(3))
df.to_csv(filename, index=False)
# combine into one
mode = "w"
header = True
for filename in all_files:
with pd.read_csv(
filename,
engine="python",
iterator=True,
chunksize=10_000,
) as reader:
for df in reader:
filename = os.path.basename(filename)
df["file_name"] = filename
df.to_csv("some_file.csv", index=False, mode=mode, header=header)
mode = "a"
header = False
另一个解决方案是使用 dask
:
# pip install dask
import dask.dataframe as dd
# dd.read_csv is mostly compatible with pd.read_csv options
# so can specify reading specific columns, etc.
ddf = dd.read_csv("some_path/*.csv")
ddf.to_csv('merged_file.csv', index=False, single_file=True)
旧的 csv
模块一次可以处理一行,因此内存不会成为问题。以下代码将连接仅保留第一个 header 的 csv 文件,并添加一个用文件名填充的文件名列。
path = r'path_to_files/'
all_files = glob.glob(path + "/*.csv")
import csv
with open('updated_thirteen_jan.csv','w', newline='') as wfd:
wr = csv.writer(wfd)
first = True
for f in all_files:
with open(f) as fd:
rd = csv.reader(fd)
# skip header line, except for the first file
row = next(rd)
if first:
row.append('filename')
wr.writerow(row)
first = False
for row in rd:
row.append(f)
wr.writerow(row)
一次将一个文件读入 pandas 数据帧,向其添加新列并将其写入新文件。
import os
import glob
import pathlib
path = 'path_to_files/'
out_file = 'updated_thirteen_jan.csv'
all_files = glob.glob(path + '*.csv')
all_files = sorted([pathlib.Path(i) for i in all_files])
keep_cols = ['list', 'of', 'columns', 'to', 'keep']
skip_row = 2 # number of rows to skip
for fn in all_files:
temp = pd.read_csv(fn, usecols=keep_cols, skiprows=skip_row)
temp['filename'] = fn.stem
temp.to_csv(out_file, mode='a', index=False, header=not os.path.isfile(out_file))
如果将整个 csv 读入内存不可行,则使用 chunksize。根据您的机器容量修改此值。
for fn in all_files:
reader = pd.read_csv(fn, usecols=keep_cols, skiprows=skip_row, chunksize=5000)
for idx, df in enumerate(reader):
df['filename'] = fn.stem
df.to_csv(out_file, mode='a', index=False, header=not os.path.isfile(out_file))