使用 pandas 过滤 csv 并创建矩阵的最快方法
Fastest way to filter csv using pandas and create a matrix
输入字典
{'basename_AM1.csv': ['AM1286', 'AM1287', 'AM1288']}
我有以下格式的大型 csv 文件
basename_AM1.csv
我有以下格式的大型 csv 文件
basename_AM1.csv
ID1 ID2 Score
0 AM1287 AM1286 97.55
1 AM1288 AM1286 78.91
2 AM1289 AM1286 95.38
3 AM1290 AM1286 94.83
4 AM1291 AM1286 82.91
现在我需要通过 searching/filter csv 文件
为给定的 input_dict 创建如下相似度字典
{'AM1286': {'AM1286': 0, 'AM287': 97.55, 'AM288': 78.91},
'AM1287': {'AM1286': 97.55, 'AM1287': 100.0, 'AM1288': 78.91},
'AM1288': {'AM1286': 78.91, 'AM1287': 78.91, 'AM1288': 100.0}}
我想出了以下逻辑,但是对于 input_dict 的 100 个样本,这需要太长时间,有人可以建议优化和最快的方法来实现这个
for key,value in input_dict.items():
base_name_df = pd.read_csv('csv_file_path')
base_name_df.columns = "ID1","ID2","Score"
if os.path.exists('csv_file_path'):
for id1 in range(len(value)):
for id2 in range(len(value)):
scan_df = base_name_df[(base_name_df['ID1'] == value[id1]) & (base_name_df['ID2'] == value[id2])]
if not scan_df.empty:
scan_df = scan_df.groupby(['LIMSID1','LIMSID2'], as_index=False)['Score'].max()
final_dict[value[id1]][value[id2]] = scan_df.iloc[0]['Score']
Pandas 有一个内置的 read_csv 方法。
文档位于:https://pandas.pydata.org/docs/reference/api/pandas.read_csv.html
IIUC,你可以使用:
input_dict = {'basename_AM1.csv': ['AM1286', 'AM1287', 'AM1288']}
import pandas as pd
for fname, lst in input_dict.items():
df = pd.read_csv(fname, sep='\s+', names=['ID1', 'ID2', 'score'])
df2 = df.pivot('ID1', 'ID2', 'score').reindex(index=lst, columns=lst)
df2 = df2.combine_first(df2.T).fillna(0)
# print for example
print(df2.to_dict())
如果你想要对角线上的 100:
import numpy as np
a = df2.to_numpy()
np.fill_diagonal(a, 100)
df2 = pd.DataFrame(a, index=lst, columns=lst)
输出:
{'AM1286': {'AM1286': 0.0, 'AM1287': 97.55, 'AM1288': 78.91},
'AM1287': {'AM1286': 97.55, 'AM1287': 0.0, 'AM1288': 0.0},
'AM1288': {'AM1286': 78.91, 'AM1287': 0.0, 'AM1288': 0.0}}
输入字典
{'basename_AM1.csv': ['AM1286', 'AM1287', 'AM1288']}
我有以下格式的大型 csv 文件 basename_AM1.csv 我有以下格式的大型 csv 文件
basename_AM1.csv
ID1 ID2 Score
0 AM1287 AM1286 97.55
1 AM1288 AM1286 78.91
2 AM1289 AM1286 95.38
3 AM1290 AM1286 94.83
4 AM1291 AM1286 82.91
现在我需要通过 searching/filter csv 文件
为给定的 input_dict 创建如下相似度字典{'AM1286': {'AM1286': 0, 'AM287': 97.55, 'AM288': 78.91},
'AM1287': {'AM1286': 97.55, 'AM1287': 100.0, 'AM1288': 78.91},
'AM1288': {'AM1286': 78.91, 'AM1287': 78.91, 'AM1288': 100.0}}
我想出了以下逻辑,但是对于 input_dict 的 100 个样本,这需要太长时间,有人可以建议优化和最快的方法来实现这个
for key,value in input_dict.items():
base_name_df = pd.read_csv('csv_file_path')
base_name_df.columns = "ID1","ID2","Score"
if os.path.exists('csv_file_path'):
for id1 in range(len(value)):
for id2 in range(len(value)):
scan_df = base_name_df[(base_name_df['ID1'] == value[id1]) & (base_name_df['ID2'] == value[id2])]
if not scan_df.empty:
scan_df = scan_df.groupby(['LIMSID1','LIMSID2'], as_index=False)['Score'].max()
final_dict[value[id1]][value[id2]] = scan_df.iloc[0]['Score']
Pandas 有一个内置的 read_csv 方法。
文档位于:https://pandas.pydata.org/docs/reference/api/pandas.read_csv.html
IIUC,你可以使用:
input_dict = {'basename_AM1.csv': ['AM1286', 'AM1287', 'AM1288']}
import pandas as pd
for fname, lst in input_dict.items():
df = pd.read_csv(fname, sep='\s+', names=['ID1', 'ID2', 'score'])
df2 = df.pivot('ID1', 'ID2', 'score').reindex(index=lst, columns=lst)
df2 = df2.combine_first(df2.T).fillna(0)
# print for example
print(df2.to_dict())
如果你想要对角线上的 100:
import numpy as np
a = df2.to_numpy()
np.fill_diagonal(a, 100)
df2 = pd.DataFrame(a, index=lst, columns=lst)
输出:
{'AM1286': {'AM1286': 0.0, 'AM1287': 97.55, 'AM1288': 78.91},
'AM1287': {'AM1286': 97.55, 'AM1287': 0.0, 'AM1288': 0.0},
'AM1288': {'AM1286': 78.91, 'AM1287': 0.0, 'AM1288': 0.0}}