嵌套字典到 CSV 的转换优化
Nested dictionary to CSV convertion optimization
我有这样的字典:
no_empty_keys = {'783': [['4gsx', 'ADTQGS', 0.3333333333333333, {'A': ['A224', 'T226'], 'B': ['A224', 'T226']}, 504, 509], ['4gt0', 'ADTQGS', 0.3333333333333333, {'A': ['A224', 'T226'], 'B': ['A224', 'T226']}, 504, 509]],'1062': [['4gsx', 'AELTGY', 0.5, {'A': ['L175', 'T176', 'Y178'], 'B': ['L175', 'T176', 'Y178']}, 453, 458], ['4gt0', 'AELTGY', 0.5, {'A': ['L175', 'T176', 'Y178'], 'B': ['L175', 'T176', 'Y178']}, 453, 458]]}
我将其转换为 CSV 的函数是:
epitope_df = pd.DataFrame(columns=['Epitope ID', 'PDB', 'Percent Identity', 'Epitope Mapped', 'Epitope Sequence', 'Starting Position', 'Ending Position'])
for x in no_empty_keys:
for y in no_empty_keys[x]:
epitope_df = epitope_df.append({'Epitope ID': x, 'PDB': y[0], 'Percent Identity': y[2], 'Epitope Mapped' : y[3], 'Epitope Sequence' : y[1], 'Starting Position' : y[4], 'Ending Position' : y[5]}, ignore_index=True)
epitope_df.to_csv('test.csv', index=False)
我的输出是这样的 csv 文件:
它正在运行,但未得到很好的优化。当我 运行 进入包含超过 10,000 个条目的字典时,这个过程非常缓慢。关于如何加快此过程的任何想法?谢谢你的时间。
我将从摆脱 pandas.append
开始。将行附加到 DataFrame 是 。您可以一次性创建一个 DataFrame:
result = []
for x in no_empty_keys:
for y in no_empty_keys[x]:
result.append(
{
'Epitope ID': x,
'PDB': y[0],
'Percent Identity': y[2],
'Epitope Mapped': y[3],
'Epitope Sequence': y[1],
'Starting Position': y[4],
'Ending Position': y[5]
}
)
epitope_df = epitope_df.from_records(result)
epitope_df.to_csv('new.csv', index=False)
您可以手动编写临时代码或使用 convtools 库,它会为您生成此类转换器:
from convtools import conversion as c
from convtools.contrib.tables import Table
no_empty_keys = {
"783": [
[ "4gsx", "ADTQGS", 0.3333333333333333, {"A": ["A224", "T226"], "B": ["A224", "T226"]}, 504, 509, ],
[ "4gt0", "ADTQGS", 0.3333333333333333, {"A": ["A224", "T226"], "B": ["A224", "T226"]}, 504, 509, ],
],
"1062": [
[ "4gsx", "AELTGY", 0.5, {"A": ["L175", "T176", "Y178"], "B": ["L175", "T176", "Y178"]}, 453, 458,],
[ "4gt0", "AELTGY", 0.5, {"A": ["L175", "T176", "Y178"], "B": ["L175", "T176", "Y178"]}, 453, 458, ],
],
}
columns = (
"Epitope ID",
"PDB",
"Percent Identity",
"Epitope Mapped",
"Epitope Sequence",
"Starting Position",
"Ending Position",
)
# this is just a function, so it can be run on startup once and stored for
# further reuse
converter = (
c.iter(
c.zip(
c.repeat(c.item(0)),
c.item(1)
).iter(
(c.item(0),) + tuple(c.item(1, i) for i in range(len(columns) - 1))
)
)
.flatten()
.gen_converter()
)
# here is the stuff to profile
Table.from_rows(
converter(no_empty_keys.items()),
header=columns,
).into_csv("out.csv")
如果您对 convtools
的幕后生成代码感到好奇,请考虑安装 black
并将 debug=True
传递给 gen_converter
。
我有这样的字典:
no_empty_keys = {'783': [['4gsx', 'ADTQGS', 0.3333333333333333, {'A': ['A224', 'T226'], 'B': ['A224', 'T226']}, 504, 509], ['4gt0', 'ADTQGS', 0.3333333333333333, {'A': ['A224', 'T226'], 'B': ['A224', 'T226']}, 504, 509]],'1062': [['4gsx', 'AELTGY', 0.5, {'A': ['L175', 'T176', 'Y178'], 'B': ['L175', 'T176', 'Y178']}, 453, 458], ['4gt0', 'AELTGY', 0.5, {'A': ['L175', 'T176', 'Y178'], 'B': ['L175', 'T176', 'Y178']}, 453, 458]]}
我将其转换为 CSV 的函数是:
epitope_df = pd.DataFrame(columns=['Epitope ID', 'PDB', 'Percent Identity', 'Epitope Mapped', 'Epitope Sequence', 'Starting Position', 'Ending Position'])
for x in no_empty_keys:
for y in no_empty_keys[x]:
epitope_df = epitope_df.append({'Epitope ID': x, 'PDB': y[0], 'Percent Identity': y[2], 'Epitope Mapped' : y[3], 'Epitope Sequence' : y[1], 'Starting Position' : y[4], 'Ending Position' : y[5]}, ignore_index=True)
epitope_df.to_csv('test.csv', index=False)
我的输出是这样的 csv 文件:
它正在运行,但未得到很好的优化。当我 运行 进入包含超过 10,000 个条目的字典时,这个过程非常缓慢。关于如何加快此过程的任何想法?谢谢你的时间。
我将从摆脱 pandas.append
开始。将行附加到 DataFrame 是
result = []
for x in no_empty_keys:
for y in no_empty_keys[x]:
result.append(
{
'Epitope ID': x,
'PDB': y[0],
'Percent Identity': y[2],
'Epitope Mapped': y[3],
'Epitope Sequence': y[1],
'Starting Position': y[4],
'Ending Position': y[5]
}
)
epitope_df = epitope_df.from_records(result)
epitope_df.to_csv('new.csv', index=False)
您可以手动编写临时代码或使用 convtools 库,它会为您生成此类转换器:
from convtools import conversion as c
from convtools.contrib.tables import Table
no_empty_keys = {
"783": [
[ "4gsx", "ADTQGS", 0.3333333333333333, {"A": ["A224", "T226"], "B": ["A224", "T226"]}, 504, 509, ],
[ "4gt0", "ADTQGS", 0.3333333333333333, {"A": ["A224", "T226"], "B": ["A224", "T226"]}, 504, 509, ],
],
"1062": [
[ "4gsx", "AELTGY", 0.5, {"A": ["L175", "T176", "Y178"], "B": ["L175", "T176", "Y178"]}, 453, 458,],
[ "4gt0", "AELTGY", 0.5, {"A": ["L175", "T176", "Y178"], "B": ["L175", "T176", "Y178"]}, 453, 458, ],
],
}
columns = (
"Epitope ID",
"PDB",
"Percent Identity",
"Epitope Mapped",
"Epitope Sequence",
"Starting Position",
"Ending Position",
)
# this is just a function, so it can be run on startup once and stored for
# further reuse
converter = (
c.iter(
c.zip(
c.repeat(c.item(0)),
c.item(1)
).iter(
(c.item(0),) + tuple(c.item(1, i) for i in range(len(columns) - 1))
)
)
.flatten()
.gen_converter()
)
# here is the stuff to profile
Table.from_rows(
converter(no_empty_keys.items()),
header=columns,
).into_csv("out.csv")
如果您对 convtools
的幕后生成代码感到好奇,请考虑安装 black
并将 debug=True
传递给 gen_converter
。