将两个 defaultdict(list) 与逻辑条件进行比较
compare two defaultdict(list) with logical conditions
两个默认字典(列表)
ids
3:42259955 [{'chr': '3', 'ref': 'G', 'depth': '224', 'base': 'A', 'count': '1', 'positive_strand': '0', 'negative_strand': '1', 'percent_bias': 0.0, 'vaf': 0.0, 'mutation': 'snv', 'group': '5555', 'timepoint': 'D0', 'st': '42259955'}, {'chr': '3', 'ref': 'G', 'depth': '224', 'base': 'C', 'count': '0', 'positive_strand': '0', 'negative_strand': '0', 'percent_bias': '0', 'vaf': '0', 'mutation': 'snv', 'group': '5555', 'timepoint': 'D0', 'st': '42259955'}, {'chr': '3', 'ref': 'G', 'depth': '224', 'base': 'G', 'count': '223', 'positive_strand': '121', 'negative_strand': '102', 'percent_bias': 0.54, 'vaf': 1.0, 'mutation': 'no-mutation', 'group': '5555', 'timepoint': 'D0', 'st': '42259955'}, {'chr': '3', 'ref': 'G', 'depth': '224', 'base': 'T', 'count': '0', 'positive_strand': '0', 'negative_strand': '0', 'percent_bias': '0', 'vaf': '0', 'mutation': 'snv', 'group': '5555', 'timepoint': 'D0', 'st': '42259955'}, {'chr': '3', 'ref': 'G', 'depth': '224', 'base': 'N', 'count': '0', 'positive_strand': '0', 'negative_strand': '0', 'percent_bias': '0', 'vaf': '0', 'mutation': 'snv', 'group': '5555', 'timepoint': 'D0', 'st': '42259955'}]
V1
3:42259955 [{'group': '5555', 'timepoint': 'D0', 'chrm': '3', 'st': '42259955', 'en': '42259956', 'var': 'C'}, {'group': '5555', 'timepoint': 'C1', 'chrm': '3', 'st': '42259955', 'en': '42259956', 'var': 'C'}, {'group': '5555', 'timepoint': 'C3', 'chrm': '3', 'st': '42259955', 'en': '42259956', 'var': 'C'}, {'group': '5555', 'timepoint': 'C4', 'chrm': '3', 'st': '42259955', 'en': '42259956', 'var': 'C'}]
我打算做的是
比较两个默认的字典列表
首先检查是关键匹配
检查 ref 和 base 在 ids 中是否相同,如果是,则存储深度信息,这将是常量
这是哪个条目
{'chr': '3', 'ref': 'G', 'depth': '224', 'base': 'G', 'count' : '223', 'positive_strand': '121', 'negative_strand': '102', 'percent_bias': 0.54, 'vaf': 1.0, 'mutation': 'no-mutation', 'group': '5555', 'timepoint': 'D0', 'st': '42259955'}
检查 V1 中 ids == var(在本例中为 'C')的基数,如果是,则从 ids 中获取计数(为 0)
检查时间点,如果时间点不在 ids 中但在 variant 中获取时间点信息并从 ids
中填写其他信息
期望的输出
position timepoint chr st depth count base positive_strand negative_strand percent_bias vaf
3:42259955 D0 3 42259955 224 0 C 0 0 0 0
3:42259955 C1 3 42259955 224 0 C 0 0 0 0
3:42259955 C3 3 42259955 224 0 C 0 0 0 0
3:42259955 C4 3 42259955 224 0 C 0 0 0 0
到目前为止我有什么
def getValueOf(k, L):
#print(L)
print(len(L))
for i, v in enumerate(d[k] for d in L):
return i,v
for key in ids.keys() & V1.keys():
## first cond compare within each list
if getValueOf('ref', ids[key]) == getValueOf('base', ids[key]):
ref_count = getValueOf('count', ids[key])
ref_depth = getValueOf('depth', ids[key])
## secon cond compare between two deafultdicts
if getValueOf('var', V1[key]) == getValueOf('base', ids[key]):
var_count = getValueOf('count', ids[key])
有没有比这更优雅的方法,我应该首先使用 defaultdict 还是嵌套字典应该工作
更新
V1
3:42259955 [{'group': '555', 'timepoint': 'D0', 'chrm': '3', 'st': '42259955', 'en': '42259956', 'var': 'C'}, {'group': '555', 'timepoint': 'C1', 'chrm': '3', 'st': '42259955', 'en': '42259956', 'var': 'C'}, {'group': '555', 'timepoint': 'C3', 'chrm': '3', 'st': '42259955', 'en': '42259956', 'var': 'C'}, {'group': '555', 'timepoint': 'C4', 'chrm': '3', 'st': '42259955', 'en': '42259956', 'var': 'C'}]
ids
3:42259955 [{'chr': '3', 'ref': 'G', 'depth': '141', 'base': 'A', 'count': '1', 'positive_strand': '0', 'negative_strand': '1', 'percent_bias': 0.0, 'vaf': 0.01, 'mutation': 'snv', 'group': '555', 'timepoint': 'C4', 'st': '42259955'}, {'chr': '3', 'ref': 'G', 'depth': '141', 'base': 'C', 'count': '4', 'positive_strand': '0', 'negative_strand': '4', 'percent_bias': 0.0, 'vaf': 0.03, 'mutation': 'snv', 'group': '555', 'timepoint': 'C4', 'st': '42259955'}, {'chr': '3', 'ref': 'G', 'depth': '141', 'base': 'G', 'count': '135', 'positive_strand': '99', 'negative_strand': '36', 'percent_bias': 0.73, 'vaf': 0.96, 'mutation': 'no-mutation', 'group': '555', 'timepoint': 'C4', 'st': '42259955'}, {'chr': '3', 'ref': 'G', 'depth': '141', 'base': 'T', 'count': '1', 'positive_strand': '0', 'negative_strand': '1', 'percent_bias': 0.0, 'vaf': 0.01, 'mutation': 'snv', 'group': '555', 'timepoint': 'C4', 'st': '42259955'}, {'chr': '3', 'ref': 'G', 'depth': '141', 'base': 'N', 'count': '0', 'positive_strand': '0', 'negative_strand': '0', 'percent_bias': '0', 'vaf': '0', 'mutation': 'snv', 'group': '555', 'timepoint': 'C4', 'st': '42259955'}, {'chr': '3', 'ref': 'G', 'depth': '141', 'base': '+A', 'count': '1', 'positive_strand': '0', 'negative_strand': '1', 'percent_bias': 0.0, 'vaf': 0.01, 'mutation': 'ins', 'group': '555', 'timepoint': 'C4', 'st': '42259955'}, {'chr': '3', 'ref': 'G', 'depth': '141', 'base': '+C', 'count': '13', 'positive_strand': '0', 'negative_strand': '13', 'percent_bias': 0.0, 'vaf': 0.09, 'mutation': 'ins', 'group': '555', 'timepoint': 'C4', 'st': '42259955'}, {'chr': '3', 'ref': 'G', 'depth': '141', 'base': '+T', 'count': '11', 'positive_strand': '0', 'negative_strand': '11', 'percent_bias': 0.0, 'vaf': 0.08, 'mutation': 'ins', 'group': '555', 'timepoint': 'C4', 'st': '42259955'}]
来自代码
position timepoint chr ref st depth count base positive_strand negative_strand percent_bias vaf
0 3:42259955 D0 3 G 42259955 141 4 C 0 4 0.0 0.03
1 3:42259955 C1 3 G 42259955 141 4 C 0 4 0.0 0.03
2 3:42259955 C3 3 G 42259955 141 4 C 0 4 0.0 0.03
3 3:42259955 C4 3 G 42259955 141 4 C 0 4 0.0 0.03
期望的输出
position timepoint chr ref st depth count base positive_strand negative_strand percent_bias vaf
0 3:42259955 D0 3 G 42259955 141 0 C 0 0 0.0 0.00
1 3:42259955 C1 3 G 42259955 141 0 C 0 0 0.0 0.00
2 3:42259955 C3 3 G 42259955 141 0 C 0 0 0.0 0.00
3 3:42259955 C4 3 G 42259955 141 4 C 0 4 0.0 0.03
好的,所以我仍然不确定是否已将您的要求降低 100%。当然,很难知道在更大的数据集中会出现什么奇怪的情况,也很难知道这在规模上会变得多么低效。不过我想我已经解决了你的问题。
已更新以解决新问题:
这应该是一个可行的解决方案。然而在这一点上有太多的条件和皱纹,我怀疑我们最好使用 pandas
创建一些表并在代码的效率和简单性方面执行一些连接和聚合查询,而不是学习如何使用 for 循环遍历嵌套的字典。
def comb_dicts(ids, v1):
fields = [
'position', 'timepoint', 'chr',
'st', 'depth', 'count', 'base',
'positive_strand', 'negative_strand',
'percent_bias', 'vaf'
]
def_cols = {
'count': 0, 'positive_strand': 0,
'negative_strand': 0, 'percent_bias': 0.0, 'vaf': 0.0
}
# Make a list for our output rows
rows = []
# Iterate through shared keys
for k in ids.keys() & v1.keys():
# Empty list for our new var dicts
var_ds = []
# Loop through the dicts in V1
for d in v1[k]:
# Find any matching dicts in the ids list - where the timepoints match
# Use ** unpacking to create new dicts - don't update because that will alter the originals
# Note the order of v and d, this ensures that any keys in both use the value from the V1 dict
# This is important later
var_ds = [
{**v, **d, 'position': k} for v in ids[k]
if (
v['base'] != v['ref'] and
d['var'] == v['base'] and
d['timepoint'] == v['timepoint']
)
]
# If we didn't find any with matching timepoints in ids then look for ones without
# This is where the order of v and d is important. We will keep the V1 timepoint this way
# Since this case can result in a list of dicts where some could actually be identical
# we will need to de-dup it at some point - can do this later with pandas
# By unpacking def_cols last we can overwrite columns that we don't want copied from ids
if not var_ds:
var_ds = [
{**v, **d, 'position': k, **def_cols} for v in ids[k]
if (
v['base'] != v['ref'] and
d['var'] == v['base']
)
]
rows.extend(var_ds)
return rows
my_rows = comb_dicts(ids, V1)
df = pd.DataFrame.from_records(my_rows)
df.drop_duplicates(inplace=True)
df[fields]
# If you want the de-duped rows as a list of dicts then do
uniq_rows = df.to_dict('records')
两个默认字典(列表)
ids
3:42259955 [{'chr': '3', 'ref': 'G', 'depth': '224', 'base': 'A', 'count': '1', 'positive_strand': '0', 'negative_strand': '1', 'percent_bias': 0.0, 'vaf': 0.0, 'mutation': 'snv', 'group': '5555', 'timepoint': 'D0', 'st': '42259955'}, {'chr': '3', 'ref': 'G', 'depth': '224', 'base': 'C', 'count': '0', 'positive_strand': '0', 'negative_strand': '0', 'percent_bias': '0', 'vaf': '0', 'mutation': 'snv', 'group': '5555', 'timepoint': 'D0', 'st': '42259955'}, {'chr': '3', 'ref': 'G', 'depth': '224', 'base': 'G', 'count': '223', 'positive_strand': '121', 'negative_strand': '102', 'percent_bias': 0.54, 'vaf': 1.0, 'mutation': 'no-mutation', 'group': '5555', 'timepoint': 'D0', 'st': '42259955'}, {'chr': '3', 'ref': 'G', 'depth': '224', 'base': 'T', 'count': '0', 'positive_strand': '0', 'negative_strand': '0', 'percent_bias': '0', 'vaf': '0', 'mutation': 'snv', 'group': '5555', 'timepoint': 'D0', 'st': '42259955'}, {'chr': '3', 'ref': 'G', 'depth': '224', 'base': 'N', 'count': '0', 'positive_strand': '0', 'negative_strand': '0', 'percent_bias': '0', 'vaf': '0', 'mutation': 'snv', 'group': '5555', 'timepoint': 'D0', 'st': '42259955'}]
V1
3:42259955 [{'group': '5555', 'timepoint': 'D0', 'chrm': '3', 'st': '42259955', 'en': '42259956', 'var': 'C'}, {'group': '5555', 'timepoint': 'C1', 'chrm': '3', 'st': '42259955', 'en': '42259956', 'var': 'C'}, {'group': '5555', 'timepoint': 'C3', 'chrm': '3', 'st': '42259955', 'en': '42259956', 'var': 'C'}, {'group': '5555', 'timepoint': 'C4', 'chrm': '3', 'st': '42259955', 'en': '42259956', 'var': 'C'}]
我打算做的是
比较两个默认的字典列表
首先检查是关键匹配
检查 ref 和 base 在 ids 中是否相同,如果是,则存储深度信息,这将是常量
这是哪个条目
{'chr': '3', 'ref': 'G', 'depth': '224', 'base': 'G', 'count' : '223', 'positive_strand': '121', 'negative_strand': '102', 'percent_bias': 0.54, 'vaf': 1.0, 'mutation': 'no-mutation', 'group': '5555', 'timepoint': 'D0', 'st': '42259955'}
检查 V1 中 ids == var(在本例中为 'C')的基数,如果是,则从 ids 中获取计数(为 0)
检查时间点,如果时间点不在 ids 中但在 variant 中获取时间点信息并从 ids
期望的输出
position timepoint chr st depth count base positive_strand negative_strand percent_bias vaf
3:42259955 D0 3 42259955 224 0 C 0 0 0 0
3:42259955 C1 3 42259955 224 0 C 0 0 0 0
3:42259955 C3 3 42259955 224 0 C 0 0 0 0
3:42259955 C4 3 42259955 224 0 C 0 0 0 0
到目前为止我有什么
def getValueOf(k, L):
#print(L)
print(len(L))
for i, v in enumerate(d[k] for d in L):
return i,v
for key in ids.keys() & V1.keys():
## first cond compare within each list
if getValueOf('ref', ids[key]) == getValueOf('base', ids[key]):
ref_count = getValueOf('count', ids[key])
ref_depth = getValueOf('depth', ids[key])
## secon cond compare between two deafultdicts
if getValueOf('var', V1[key]) == getValueOf('base', ids[key]):
var_count = getValueOf('count', ids[key])
有没有比这更优雅的方法,我应该首先使用 defaultdict 还是嵌套字典应该工作
更新
V1
3:42259955 [{'group': '555', 'timepoint': 'D0', 'chrm': '3', 'st': '42259955', 'en': '42259956', 'var': 'C'}, {'group': '555', 'timepoint': 'C1', 'chrm': '3', 'st': '42259955', 'en': '42259956', 'var': 'C'}, {'group': '555', 'timepoint': 'C3', 'chrm': '3', 'st': '42259955', 'en': '42259956', 'var': 'C'}, {'group': '555', 'timepoint': 'C4', 'chrm': '3', 'st': '42259955', 'en': '42259956', 'var': 'C'}]
ids
3:42259955 [{'chr': '3', 'ref': 'G', 'depth': '141', 'base': 'A', 'count': '1', 'positive_strand': '0', 'negative_strand': '1', 'percent_bias': 0.0, 'vaf': 0.01, 'mutation': 'snv', 'group': '555', 'timepoint': 'C4', 'st': '42259955'}, {'chr': '3', 'ref': 'G', 'depth': '141', 'base': 'C', 'count': '4', 'positive_strand': '0', 'negative_strand': '4', 'percent_bias': 0.0, 'vaf': 0.03, 'mutation': 'snv', 'group': '555', 'timepoint': 'C4', 'st': '42259955'}, {'chr': '3', 'ref': 'G', 'depth': '141', 'base': 'G', 'count': '135', 'positive_strand': '99', 'negative_strand': '36', 'percent_bias': 0.73, 'vaf': 0.96, 'mutation': 'no-mutation', 'group': '555', 'timepoint': 'C4', 'st': '42259955'}, {'chr': '3', 'ref': 'G', 'depth': '141', 'base': 'T', 'count': '1', 'positive_strand': '0', 'negative_strand': '1', 'percent_bias': 0.0, 'vaf': 0.01, 'mutation': 'snv', 'group': '555', 'timepoint': 'C4', 'st': '42259955'}, {'chr': '3', 'ref': 'G', 'depth': '141', 'base': 'N', 'count': '0', 'positive_strand': '0', 'negative_strand': '0', 'percent_bias': '0', 'vaf': '0', 'mutation': 'snv', 'group': '555', 'timepoint': 'C4', 'st': '42259955'}, {'chr': '3', 'ref': 'G', 'depth': '141', 'base': '+A', 'count': '1', 'positive_strand': '0', 'negative_strand': '1', 'percent_bias': 0.0, 'vaf': 0.01, 'mutation': 'ins', 'group': '555', 'timepoint': 'C4', 'st': '42259955'}, {'chr': '3', 'ref': 'G', 'depth': '141', 'base': '+C', 'count': '13', 'positive_strand': '0', 'negative_strand': '13', 'percent_bias': 0.0, 'vaf': 0.09, 'mutation': 'ins', 'group': '555', 'timepoint': 'C4', 'st': '42259955'}, {'chr': '3', 'ref': 'G', 'depth': '141', 'base': '+T', 'count': '11', 'positive_strand': '0', 'negative_strand': '11', 'percent_bias': 0.0, 'vaf': 0.08, 'mutation': 'ins', 'group': '555', 'timepoint': 'C4', 'st': '42259955'}]
来自代码
position timepoint chr ref st depth count base positive_strand negative_strand percent_bias vaf
0 3:42259955 D0 3 G 42259955 141 4 C 0 4 0.0 0.03
1 3:42259955 C1 3 G 42259955 141 4 C 0 4 0.0 0.03
2 3:42259955 C3 3 G 42259955 141 4 C 0 4 0.0 0.03
3 3:42259955 C4 3 G 42259955 141 4 C 0 4 0.0 0.03
期望的输出
position timepoint chr ref st depth count base positive_strand negative_strand percent_bias vaf
0 3:42259955 D0 3 G 42259955 141 0 C 0 0 0.0 0.00
1 3:42259955 C1 3 G 42259955 141 0 C 0 0 0.0 0.00
2 3:42259955 C3 3 G 42259955 141 0 C 0 0 0.0 0.00
3 3:42259955 C4 3 G 42259955 141 4 C 0 4 0.0 0.03
好的,所以我仍然不确定是否已将您的要求降低 100%。当然,很难知道在更大的数据集中会出现什么奇怪的情况,也很难知道这在规模上会变得多么低效。不过我想我已经解决了你的问题。
已更新以解决新问题:
这应该是一个可行的解决方案。然而在这一点上有太多的条件和皱纹,我怀疑我们最好使用 pandas
创建一些表并在代码的效率和简单性方面执行一些连接和聚合查询,而不是学习如何使用 for 循环遍历嵌套的字典。
def comb_dicts(ids, v1):
fields = [
'position', 'timepoint', 'chr',
'st', 'depth', 'count', 'base',
'positive_strand', 'negative_strand',
'percent_bias', 'vaf'
]
def_cols = {
'count': 0, 'positive_strand': 0,
'negative_strand': 0, 'percent_bias': 0.0, 'vaf': 0.0
}
# Make a list for our output rows
rows = []
# Iterate through shared keys
for k in ids.keys() & v1.keys():
# Empty list for our new var dicts
var_ds = []
# Loop through the dicts in V1
for d in v1[k]:
# Find any matching dicts in the ids list - where the timepoints match
# Use ** unpacking to create new dicts - don't update because that will alter the originals
# Note the order of v and d, this ensures that any keys in both use the value from the V1 dict
# This is important later
var_ds = [
{**v, **d, 'position': k} for v in ids[k]
if (
v['base'] != v['ref'] and
d['var'] == v['base'] and
d['timepoint'] == v['timepoint']
)
]
# If we didn't find any with matching timepoints in ids then look for ones without
# This is where the order of v and d is important. We will keep the V1 timepoint this way
# Since this case can result in a list of dicts where some could actually be identical
# we will need to de-dup it at some point - can do this later with pandas
# By unpacking def_cols last we can overwrite columns that we don't want copied from ids
if not var_ds:
var_ds = [
{**v, **d, 'position': k, **def_cols} for v in ids[k]
if (
v['base'] != v['ref'] and
d['var'] == v['base']
)
]
rows.extend(var_ds)
return rows
my_rows = comb_dicts(ids, V1)
df = pd.DataFrame.from_records(my_rows)
df.drop_duplicates(inplace=True)
df[fields]
# If you want the de-duped rows as a list of dicts then do
uniq_rows = df.to_dict('records')