根据 Python 中的文件名和后缀合并 csv 文件
Merge csv files based on file names and suffix in Python
第一次发帖,Python 还很陌生。我收集了 +1,7000 个 csv 文件,每个文件有 2 列。每个文件中的行数和标签都相同。这些文件以特定格式命名。例如:
- Species_1_OrderA_1.csv
- Species_1_OrderA_2.csv
- Species_1_OrderA_3.csv
- Species_10_OrderB_1.csv
- Species_10_OrderB_2.csv
每个导入的数据框的格式如下:
TreeID Species_1_OrderA_2
0 Bu2_1201_1992 0
1 Bu3_1201_1998 0
2 Bu4_1201_2000 0
3 Bu5_1201_2002 0
4 Bu6_1201_2004 0
.. ... ...
307 Fi141_16101_2004 0
308 Fi142_16101_2006 0
309 Fi143_16101_2008 0
310 Fi144_16101_2010 0
311 Fi147_16101_2015 0
我想根据第一列加入对应于同一物种的文件。所以,最后,我会得到文件 Species_1_OrderA.csv 和 Species_10_OrderB.csv。请注意,所有物种的文件数量不一定相同。
这是我目前尝试过的方法。
import os
import glob
import pandas as pd
# Importing csv files from directory
path = '.'
extension = 'csv'
os.chdir(path)
files = glob.glob('*.{}'.format(extension))
# Create a dictionary to loop through each file to read its contents and create a dataframe
file_dict = {}
for file in files:
key = file
df = pd.read_csv(file)
file_dict[key] = df
# Extract the name of each dataframe, convert to a list and extract the relevant
# information (before the 3rd underscore). Compare each of these values to the next and
# if they are the same, append them to a list. This list (in my head, at least) will help
# me merge them using pandas.concat
keys_list = list(file_dict.keys())
group = ''
for line in keys_list:
type = "_".join(line.split("_")[:3])
for i in range(len(type) - 1):
if type[i] == type[i+1]:
group.append(line[keys_list])
print(group)
但是,最后一点甚至不起作用,此时,我不确定这是处理我的问题的最佳方法。非常感谢有关如何解决此问题的任何指示。
--- 编辑:
这是每个物种文件的预期输出。理想情况下,我会删除其中包含零的行,但这可以使用 awk 轻松完成。
TreeID,Species_1_OrderA_0,Species_1_OrderA_1,Species_1_OrderA_2
Bu2_1201_1992,0,0,0
Bu3_1201_1998,0,0,0
Bu4_1201_2000,0,0,0
Bu5_1201_2002,0,0,0
Bu6_1201_2004,0,0,0
Bu7_1201_2006,0,0,0
Bu8_1201_2008,0,0,0
Bu9_1201_2010,0,0,0
Bu10_1201_2012,0,0,0
Bu11_1201_2014,0,0,0
Bu14_1201_2016,0,0,0
Bu16_1201_2018,0,0,0
Bu18_3103_1989,0,0,0
Bu22_3103_1999,0,0,0
Bu23_3103_2001,0,0,0
Bu24_3103_2003,0,0,0
...
Fi141_16101_2004,0,0,10
Fi142_16101_2006,0,4,0
Fi143_16101_2008,0,0,0
Fi144_16101_2010,2,0,0
Fi147_16101_2015,0,7,0
``
如果您的目标是将每个物种顺序的所有 csv 连接成一个合并的 csv,这是一种方法。我还没有测试过,所以可能会有一些错误。这个想法是首先使用 glob,就像你正在做的那样,对 file_paths 进行字典,以便将所有相同物种顺序的 file_paths 组合在一起。然后对于每个物种顺序,将所有数据读入内存中的单个 table,然后写入合并文件。
import pandas as pd
import glob
#Create a dictionary keyed by species_order, valued by a list of files
#i.e. file_paths_by_species_order['Species_10_OrderB'] = ['Species_10_OrderB_1.csv', 'Species_10_OrderB_2.csv']
file_paths_by_species_order = {}
for file_path in glob.glob('*.csv'):
species_order = file_path.split("_")[:3]
if species_order not in file_paths_by_species_order:
file_paths_by_species_order[species_order] = [file_path]
else:
file_paths_by_species_order[species_order].append(file_path)
#For each species_order, concat all files and save the info into a new csv
for species_order,file_paths in file_paths_by_species_order.items():
df = pd.concat(pd.read_csv(file_path) for file_path in file_paths)
df.to_csv('consolidated_{}.csv'.format(species_order))
肯定可以进行改进,例如使用 collections.defaultdict 并一次将一个文件写入合并文件,而不是将它们全部读入内存
这样试试:
import os
import pandas as pd
path = "C:/Users/username"
files = [file for file in os.listdir(path) if file.endswith(".csv")]
dfs = dict()
for file in files:
#everything before the final _ is the species name
species = file.rsplit("_", maxsplit=1)[0]
#read the csv to a dataframe
df = pd.read_csv(os.path.join(path, file))
#if you don't have a df for a species, create a new key
if species not in dfs:
dfs[species] = df
#else, merge current df to existing df on the TreeID
else:
dfs[species] = pd.merge(dfs[species], df, on="TreeID", how="outer")
#write all dfs to their own csv files
for key in dfs:
dfs[key].to_csv(f"{key}.csv")
第一次发帖,Python 还很陌生。我收集了 +1,7000 个 csv 文件,每个文件有 2 列。每个文件中的行数和标签都相同。这些文件以特定格式命名。例如:
- Species_1_OrderA_1.csv
- Species_1_OrderA_2.csv
- Species_1_OrderA_3.csv
- Species_10_OrderB_1.csv
- Species_10_OrderB_2.csv
每个导入的数据框的格式如下:
TreeID Species_1_OrderA_2
0 Bu2_1201_1992 0
1 Bu3_1201_1998 0
2 Bu4_1201_2000 0
3 Bu5_1201_2002 0
4 Bu6_1201_2004 0
.. ... ...
307 Fi141_16101_2004 0
308 Fi142_16101_2006 0
309 Fi143_16101_2008 0
310 Fi144_16101_2010 0
311 Fi147_16101_2015 0
我想根据第一列加入对应于同一物种的文件。所以,最后,我会得到文件 Species_1_OrderA.csv 和 Species_10_OrderB.csv。请注意,所有物种的文件数量不一定相同。
这是我目前尝试过的方法。
import os
import glob
import pandas as pd
# Importing csv files from directory
path = '.'
extension = 'csv'
os.chdir(path)
files = glob.glob('*.{}'.format(extension))
# Create a dictionary to loop through each file to read its contents and create a dataframe
file_dict = {}
for file in files:
key = file
df = pd.read_csv(file)
file_dict[key] = df
# Extract the name of each dataframe, convert to a list and extract the relevant
# information (before the 3rd underscore). Compare each of these values to the next and
# if they are the same, append them to a list. This list (in my head, at least) will help
# me merge them using pandas.concat
keys_list = list(file_dict.keys())
group = ''
for line in keys_list:
type = "_".join(line.split("_")[:3])
for i in range(len(type) - 1):
if type[i] == type[i+1]:
group.append(line[keys_list])
print(group)
但是,最后一点甚至不起作用,此时,我不确定这是处理我的问题的最佳方法。非常感谢有关如何解决此问题的任何指示。
--- 编辑: 这是每个物种文件的预期输出。理想情况下,我会删除其中包含零的行,但这可以使用 awk 轻松完成。
TreeID,Species_1_OrderA_0,Species_1_OrderA_1,Species_1_OrderA_2
Bu2_1201_1992,0,0,0
Bu3_1201_1998,0,0,0
Bu4_1201_2000,0,0,0
Bu5_1201_2002,0,0,0
Bu6_1201_2004,0,0,0
Bu7_1201_2006,0,0,0
Bu8_1201_2008,0,0,0
Bu9_1201_2010,0,0,0
Bu10_1201_2012,0,0,0
Bu11_1201_2014,0,0,0
Bu14_1201_2016,0,0,0
Bu16_1201_2018,0,0,0
Bu18_3103_1989,0,0,0
Bu22_3103_1999,0,0,0
Bu23_3103_2001,0,0,0
Bu24_3103_2003,0,0,0
...
Fi141_16101_2004,0,0,10
Fi142_16101_2006,0,4,0
Fi143_16101_2008,0,0,0
Fi144_16101_2010,2,0,0
Fi147_16101_2015,0,7,0
``
如果您的目标是将每个物种顺序的所有 csv 连接成一个合并的 csv,这是一种方法。我还没有测试过,所以可能会有一些错误。这个想法是首先使用 glob,就像你正在做的那样,对 file_paths 进行字典,以便将所有相同物种顺序的 file_paths 组合在一起。然后对于每个物种顺序,将所有数据读入内存中的单个 table,然后写入合并文件。
import pandas as pd
import glob
#Create a dictionary keyed by species_order, valued by a list of files
#i.e. file_paths_by_species_order['Species_10_OrderB'] = ['Species_10_OrderB_1.csv', 'Species_10_OrderB_2.csv']
file_paths_by_species_order = {}
for file_path in glob.glob('*.csv'):
species_order = file_path.split("_")[:3]
if species_order not in file_paths_by_species_order:
file_paths_by_species_order[species_order] = [file_path]
else:
file_paths_by_species_order[species_order].append(file_path)
#For each species_order, concat all files and save the info into a new csv
for species_order,file_paths in file_paths_by_species_order.items():
df = pd.concat(pd.read_csv(file_path) for file_path in file_paths)
df.to_csv('consolidated_{}.csv'.format(species_order))
肯定可以进行改进,例如使用 collections.defaultdict 并一次将一个文件写入合并文件,而不是将它们全部读入内存
这样试试:
import os
import pandas as pd
path = "C:/Users/username"
files = [file for file in os.listdir(path) if file.endswith(".csv")]
dfs = dict()
for file in files:
#everything before the final _ is the species name
species = file.rsplit("_", maxsplit=1)[0]
#read the csv to a dataframe
df = pd.read_csv(os.path.join(path, file))
#if you don't have a df for a species, create a new key
if species not in dfs:
dfs[species] = df
#else, merge current df to existing df on the TreeID
else:
dfs[species] = pd.merge(dfs[species], df, on="TreeID", how="outer")
#write all dfs to their own csv files
for key in dfs:
dfs[key].to_csv(f"{key}.csv")