将 csv.DictReader object 类型数据转换为 non-iter 类型数据并按键合并值
Convert csv.DictReader object to non-iter type data and merge values by keys
在我的数据中:
myData='''pos\tidx1\tval1\tidx2\tval2
11\t4\tC\t6\tA
15\t4\tA\t6\tT
23\t4\tT\t6\tT
28\t4\tA\t3\tG
34\t4\tG\t3\tC
41\t4\tC\t4\tT
51\t4\tC\t4\tC'''
我用 header 作为键读取了这个数据,csv.DictReader。
import csv
import itertools
input_file = csv.DictReader(io.StringIO(myData), delimiter = '\t')
# which produces an iterator
''' Now, I want to group this dictionary by idx2, where
idx2 values is the main key and other have values merged into list that have same keys'''
# This groupby method give me
file_blocks = itertools.groupby(input_file, key=lambda x: x['idx2'])
# I can print this as
for index, blocks in file_blocks:
print(index, list(blocks))
6 [{'val2': 'A', 'val1': 'C', 'idx1': '4', 'pos': '11', 'idx2': '6'}, {'val2': 'T', 'val1': 'A', 'idx1': '4', 'pos': '15', 'idx2': '6'}, {'val2': 'T', 'val1': 'T', 'idx1': '4', 'pos': '23', 'idx2': '6'}]
3 [{'val2': 'G', 'val1': 'A', 'idx1': '4', 'pos': '28', 'idx2': '3'}, {'val2': 'C', 'val1': 'G', 'idx1': '4', 'pos': '34', 'idx2': '3'}]
4 [{'val2': 'T', 'val1': 'C', 'idx1': '4', 'pos': '41', 'idx2': '4'}, {'val2': 'C', 'val1': 'C', 'idx1': '4', 'pos': '51', 'idx2': '4'}]
But, since the output is exhausted I can't print, use it more than once to debug it.
所以,
问题 #1:如何将其转换为非 iter-type 数据。
问题 #2:如何进一步处理此 groupby object 以将值合并到具有相同 group/blocks.
中的公共键的列表
Something like orderedDict, defaultDict where the order of the way the data is read is preserved:
{'6': defaultdict(<class 'list'>, {'pos': [11, 15, 23], 'idx1': [4, 4, 4], 'val1': ['C', 'A', 'T'], 'idx2': [6, 6, 6], 'val2': ['A', 'T', 'T']})}
{'3': .....
{'4': .....
我尝试的一些修复:
我宁愿在分组之前通过唯一键准备一个键:[值]:
update_dict = {}
for lines in input_file:
print(type(lines))
for k, v in lines:
update_dict['idx2'] = lines[k,v]
我尝试的另一件事是确定我是否可以合并分组 object 中的数据:
new_groupBy = {}
对于索引,file_blocks 中的块:
打印(索引,列表(块))
对于块中的 x:
对于 k,v 在 x 中:
为 new_groupBy
做点什么
所以,对于你的第一个问题,你可以简单地具体化一个列表:
In [9]: raw_data='''pos\tidx1\tval1\tidx2\tval2
...: 11\t4\tC\t6\tA
...: 15\t4\tA\t6\tT
...: 23\t4\tT\t6\tT
...: 28\t4\tA\t3\tG
...: 34\t4\tG\t3\tC
...: 41\t4\tC\t4\tT
...: 51\t4\tC\t4\tC'''
In [10]: data_stream = csv.DictReader(io.StringIO(raw_data), delimiter="\t")
In [11]: grouped = itertools.groupby(data_stream, key=lambda x:x['idx2'])
In [12]: data = [(k,list(g)) for k,g in grouped] # order is important, so use a list
In [13]: data
Out[13]:
[('6',
[{'idx1': '4', 'idx2': '6', 'pos': '11', 'val1': 'C', 'val2': 'A'},
{'idx1': '4', 'idx2': '6', 'pos': '15', 'val1': 'A', 'val2': 'T'},
{'idx1': '4', 'idx2': '6', 'pos': '23', 'val1': 'T', 'val2': 'T'}]),
('3',
[{'idx1': '4', 'idx2': '3', 'pos': '28', 'val1': 'A', 'val2': 'G'},
{'idx1': '4', 'idx2': '3', 'pos': '34', 'val1': 'G', 'val2': 'C'}]),
('4',
[{'idx1': '4', 'idx2': '4', 'pos': '41', 'val1': 'C', 'val2': 'T'},
{'idx1': '4', 'idx2': '4', 'pos': '51', 'val1': 'C', 'val2': 'C'}])]
至于你的第二个问题,试试这样的:
In [15]: import collections
In [16]: def accumulate(data):
...: acc = collections.OrderedDict()
...: for d in data:
...: for k,v in d.items():
...: acc.setdefault(k,[]).append(v)
...: return acc
...:
In [17]: grouped_data = {k:accumulate(d) for k,d in data}
In [18]: grouped_data
Out[18]:
{'3': OrderedDict([('pos', ['28', '34']),
('idx2', ['3', '3']),
('val2', ['G', 'C']),
('val1', ['A', 'G']),
('idx1', ['4', '4'])]),
'4': OrderedDict([('pos', ['41', '51']),
('idx2', ['4', '4']),
('val2', ['T', 'C']),
('val1', ['C', 'C']),
('idx1', ['4', '4'])]),
'6': OrderedDict([('pos', ['11', '15', '23']),
('idx2', ['6', '6', '6']),
('val2', ['A', 'T', 'T']),
('val1', ['C', 'A', 'T']),
('idx1', ['4', '4', '4'])])}
注意,我使用了列表(和字典)理解。他们的工作方式相似。列表理解等同于:
data = []
for k, g in grouped:
data.append((k, list(g))
虽然我使用的是 OrderedDict,但为了更好的衡量,这里相当于字典理解,因为在任何情况下,顺序似乎都很重要:
In [20]: grouped_data = collections.OrderedDict()
In [21]: for k, d in data:
...: grouped_data[k] = accumulate(d)
...:
In [22]: grouped_data
Out[22]:
OrderedDict([('6',
OrderedDict([('val2', ['A', 'T', 'T']),
('val1', ['C', 'A', 'T']),
('pos', ['11', '15', '23']),
('idx2', ['6', '6', '6']),
('idx1', ['4', '4', '4'])])),
('3',
OrderedDict([('val2', ['G', 'C']),
('val1', ['A', 'G']),
('pos', ['28', '34']),
('idx2', ['3', '3']),
('idx1', ['4', '4'])])),
('4',
OrderedDict([('val2', ['T', 'C']),
('val1', ['C', 'C']),
('pos', ['41', '51']),
('idx2', ['4', '4']),
('idx1', ['4', '4'])]))])
注意,我们可以一次完成所有操作,避免创建不必要的数据结构:
import itertools, io, csv, collections
data_stream = csv.DictReader(io.StringIO(raw_data), delimiter="\t")
grouped = itertools.groupby(data_stream, key=lambda x:x['idx2'])
def accumulate(data):
acc = collections.OrderedDict()
for d in data:
for k,v in d.items():
acc.setdefault(k,[]).append(v)
return acc
grouped_data = collections.OrderedDict()
for k, g in grouped:
grouped_data[k] = accumulate(g)
给定
import io
import csv
import itertools as it
import collections as ct
data="""pos\tidx1\tval1\tidx2\tval2
11\t4\tC\t6\tA
15\t4\tA\t6\tT
23\t4\tT\t6\tT
28\t4\tA\t3\tG
34\t4\tG\t3\tC
41\t4\tC\t4\tT
51\t4\tC\t4\tC"""
第一部分
how to I convert it into non iter-type data
代码
以下是从迭代器中保留数据的方法 - 只需将其转换为列表即可:
>>> input_file = list(csv.DictReader(io.StringIO(data), delimiter = "\t"))
>>> input_file
[{'idx1': '4', 'idx2': '6', 'pos': '11', 'val1': 'C', 'val2': 'A'},
{'idx1': '4', 'idx2': '6', 'pos': '15', 'val1': 'A', 'val2': 'T'},
{'idx1': '4', 'idx2': '6', 'pos': '23', 'val1': 'T', 'val2': 'T'},
{'idx1': '4', 'idx2': '3', 'pos': '28', 'val1': 'A', 'val2': 'G'},
{'idx1': '4', 'idx2': '3', 'pos': '34', 'val1': 'G', 'val2': 'C'},
{'idx1': '4', 'idx2': '4', 'pos': '41', 'val1': 'C', 'val2': 'T'},
{'idx1': '4', 'idx2': '4', 'pos': '51', 'val1': 'C', 'val2': 'C'}]
或者使用列表理解:
>>> file_blocks = [(k, list(g)) for k, g in it.groupby(input_file, key=lambda x: x["idx2"])]
>>> file_blocks
[('6',
[{'idx1': '4', 'idx2': '6', 'pos': '11', 'val1': 'C', 'val2': 'A'},
{'idx1': '4', 'idx2': '6', 'pos': '15', 'val1': 'A', 'val2': 'T'},
{'idx1': '4', 'idx2': '6', 'pos': '23', 'val1': 'T', 'val2': 'T'}]),
('3',
[{'idx1': '4', 'idx2': '3', 'pos': '28', 'val1': 'A', 'val2': 'G'},
{'idx1': '4', 'idx2': '3', 'pos': '34', 'val1': 'G', 'val2': 'C'}]),
('4',
[{'idx1': '4', 'idx2': '4', 'pos': '41', 'val1': 'C', 'val2': 'T'},
{'idx1': '4', 'idx2': '4', 'pos': '51', 'val1': 'C', 'val2': 'C'}])]
现在您可以重复使用 input_file
和 file_blocks
的数据。
第二部分
how can I process this groupby object further to merge the values to a list that have common keys within same group/blocks...
Something like orderedDict, defaultDict where the order of the way the data is read is preserved
def collate_data(data):
"""Yield an OrderedDict of merged dictionaries from `data`."""
for idx, item in data:
results = ct.OrderedDict()
dd = ct.defaultdict(list)
for dict_ in item:
for k, v in dict_.items():
dd[k].append(v)
results[idx] = dd
yield results
list(collate_data(file_blocks))
输出
[OrderedDict([('6',
defaultdict(list,
{'idx1': ['4', '4', '4'],
'idx2': ['6', '6', '6'],
'pos': ['11', '15', '23'],
'val1': ['C', 'A', 'T'],
'val2': ['A', 'T', 'T']}))]),
OrderedDict([('3',
defaultdict(list,
{'idx1': ['4', '4'],
'idx2': ['3', '3'],
'pos': ['28', '34'],
'val1': ['A', 'G'],
'val2': ['G', 'C']}))]),
OrderedDict([('4',
defaultdict(list,
{'idx1': ['4', '4'],
'idx2': ['4', '4'],
'pos': ['41', '51'],
'val1': ['C', 'C'],
'val2': ['T', 'C']}))])]
itertools.groupby()
个元素的顺序由 collections.OrderedDict()
维护。 collections.defaultdict()
对象中的列表保留文件各行值的顺序(参见 input_file
中的字典)。
在我的数据中:
myData='''pos\tidx1\tval1\tidx2\tval2
11\t4\tC\t6\tA
15\t4\tA\t6\tT
23\t4\tT\t6\tT
28\t4\tA\t3\tG
34\t4\tG\t3\tC
41\t4\tC\t4\tT
51\t4\tC\t4\tC'''
我用 header 作为键读取了这个数据,csv.DictReader。
import csv
import itertools
input_file = csv.DictReader(io.StringIO(myData), delimiter = '\t')
# which produces an iterator
''' Now, I want to group this dictionary by idx2, where
idx2 values is the main key and other have values merged into list that have same keys'''
# This groupby method give me
file_blocks = itertools.groupby(input_file, key=lambda x: x['idx2'])
# I can print this as
for index, blocks in file_blocks:
print(index, list(blocks))
6 [{'val2': 'A', 'val1': 'C', 'idx1': '4', 'pos': '11', 'idx2': '6'}, {'val2': 'T', 'val1': 'A', 'idx1': '4', 'pos': '15', 'idx2': '6'}, {'val2': 'T', 'val1': 'T', 'idx1': '4', 'pos': '23', 'idx2': '6'}]
3 [{'val2': 'G', 'val1': 'A', 'idx1': '4', 'pos': '28', 'idx2': '3'}, {'val2': 'C', 'val1': 'G', 'idx1': '4', 'pos': '34', 'idx2': '3'}]
4 [{'val2': 'T', 'val1': 'C', 'idx1': '4', 'pos': '41', 'idx2': '4'}, {'val2': 'C', 'val1': 'C', 'idx1': '4', 'pos': '51', 'idx2': '4'}]
But, since the output is exhausted I can't print, use it more than once to debug it.
所以, 问题 #1:如何将其转换为非 iter-type 数据。
问题 #2:如何进一步处理此 groupby object 以将值合并到具有相同 group/blocks.
中的公共键的列表Something like orderedDict, defaultDict where the order of the way the data is read is preserved:
{'6': defaultdict(<class 'list'>, {'pos': [11, 15, 23], 'idx1': [4, 4, 4], 'val1': ['C', 'A', 'T'], 'idx2': [6, 6, 6], 'val2': ['A', 'T', 'T']})}
{'3': .....
{'4': .....
我尝试的一些修复:
我宁愿在分组之前通过唯一键准备一个键:[值]:
update_dict = {}
for lines in input_file:
print(type(lines))
for k, v in lines:
update_dict['idx2'] = lines[k,v]
我尝试的另一件事是确定我是否可以合并分组 object 中的数据: new_groupBy = {} 对于索引,file_blocks 中的块: 打印(索引,列表(块)) 对于块中的 x: 对于 k,v 在 x 中: 为 new_groupBy
做点什么所以,对于你的第一个问题,你可以简单地具体化一个列表:
In [9]: raw_data='''pos\tidx1\tval1\tidx2\tval2
...: 11\t4\tC\t6\tA
...: 15\t4\tA\t6\tT
...: 23\t4\tT\t6\tT
...: 28\t4\tA\t3\tG
...: 34\t4\tG\t3\tC
...: 41\t4\tC\t4\tT
...: 51\t4\tC\t4\tC'''
In [10]: data_stream = csv.DictReader(io.StringIO(raw_data), delimiter="\t")
In [11]: grouped = itertools.groupby(data_stream, key=lambda x:x['idx2'])
In [12]: data = [(k,list(g)) for k,g in grouped] # order is important, so use a list
In [13]: data
Out[13]:
[('6',
[{'idx1': '4', 'idx2': '6', 'pos': '11', 'val1': 'C', 'val2': 'A'},
{'idx1': '4', 'idx2': '6', 'pos': '15', 'val1': 'A', 'val2': 'T'},
{'idx1': '4', 'idx2': '6', 'pos': '23', 'val1': 'T', 'val2': 'T'}]),
('3',
[{'idx1': '4', 'idx2': '3', 'pos': '28', 'val1': 'A', 'val2': 'G'},
{'idx1': '4', 'idx2': '3', 'pos': '34', 'val1': 'G', 'val2': 'C'}]),
('4',
[{'idx1': '4', 'idx2': '4', 'pos': '41', 'val1': 'C', 'val2': 'T'},
{'idx1': '4', 'idx2': '4', 'pos': '51', 'val1': 'C', 'val2': 'C'}])]
至于你的第二个问题,试试这样的:
In [15]: import collections
In [16]: def accumulate(data):
...: acc = collections.OrderedDict()
...: for d in data:
...: for k,v in d.items():
...: acc.setdefault(k,[]).append(v)
...: return acc
...:
In [17]: grouped_data = {k:accumulate(d) for k,d in data}
In [18]: grouped_data
Out[18]:
{'3': OrderedDict([('pos', ['28', '34']),
('idx2', ['3', '3']),
('val2', ['G', 'C']),
('val1', ['A', 'G']),
('idx1', ['4', '4'])]),
'4': OrderedDict([('pos', ['41', '51']),
('idx2', ['4', '4']),
('val2', ['T', 'C']),
('val1', ['C', 'C']),
('idx1', ['4', '4'])]),
'6': OrderedDict([('pos', ['11', '15', '23']),
('idx2', ['6', '6', '6']),
('val2', ['A', 'T', 'T']),
('val1', ['C', 'A', 'T']),
('idx1', ['4', '4', '4'])])}
注意,我使用了列表(和字典)理解。他们的工作方式相似。列表理解等同于:
data = []
for k, g in grouped:
data.append((k, list(g))
虽然我使用的是 OrderedDict,但为了更好的衡量,这里相当于字典理解,因为在任何情况下,顺序似乎都很重要:
In [20]: grouped_data = collections.OrderedDict()
In [21]: for k, d in data:
...: grouped_data[k] = accumulate(d)
...:
In [22]: grouped_data
Out[22]:
OrderedDict([('6',
OrderedDict([('val2', ['A', 'T', 'T']),
('val1', ['C', 'A', 'T']),
('pos', ['11', '15', '23']),
('idx2', ['6', '6', '6']),
('idx1', ['4', '4', '4'])])),
('3',
OrderedDict([('val2', ['G', 'C']),
('val1', ['A', 'G']),
('pos', ['28', '34']),
('idx2', ['3', '3']),
('idx1', ['4', '4'])])),
('4',
OrderedDict([('val2', ['T', 'C']),
('val1', ['C', 'C']),
('pos', ['41', '51']),
('idx2', ['4', '4']),
('idx1', ['4', '4'])]))])
注意,我们可以一次完成所有操作,避免创建不必要的数据结构:
import itertools, io, csv, collections
data_stream = csv.DictReader(io.StringIO(raw_data), delimiter="\t")
grouped = itertools.groupby(data_stream, key=lambda x:x['idx2'])
def accumulate(data):
acc = collections.OrderedDict()
for d in data:
for k,v in d.items():
acc.setdefault(k,[]).append(v)
return acc
grouped_data = collections.OrderedDict()
for k, g in grouped:
grouped_data[k] = accumulate(g)
给定
import io
import csv
import itertools as it
import collections as ct
data="""pos\tidx1\tval1\tidx2\tval2
11\t4\tC\t6\tA
15\t4\tA\t6\tT
23\t4\tT\t6\tT
28\t4\tA\t3\tG
34\t4\tG\t3\tC
41\t4\tC\t4\tT
51\t4\tC\t4\tC"""
第一部分
how to I convert it into non iter-type data
代码
以下是从迭代器中保留数据的方法 - 只需将其转换为列表即可:
>>> input_file = list(csv.DictReader(io.StringIO(data), delimiter = "\t"))
>>> input_file
[{'idx1': '4', 'idx2': '6', 'pos': '11', 'val1': 'C', 'val2': 'A'},
{'idx1': '4', 'idx2': '6', 'pos': '15', 'val1': 'A', 'val2': 'T'},
{'idx1': '4', 'idx2': '6', 'pos': '23', 'val1': 'T', 'val2': 'T'},
{'idx1': '4', 'idx2': '3', 'pos': '28', 'val1': 'A', 'val2': 'G'},
{'idx1': '4', 'idx2': '3', 'pos': '34', 'val1': 'G', 'val2': 'C'},
{'idx1': '4', 'idx2': '4', 'pos': '41', 'val1': 'C', 'val2': 'T'},
{'idx1': '4', 'idx2': '4', 'pos': '51', 'val1': 'C', 'val2': 'C'}]
或者使用列表理解:
>>> file_blocks = [(k, list(g)) for k, g in it.groupby(input_file, key=lambda x: x["idx2"])]
>>> file_blocks
[('6',
[{'idx1': '4', 'idx2': '6', 'pos': '11', 'val1': 'C', 'val2': 'A'},
{'idx1': '4', 'idx2': '6', 'pos': '15', 'val1': 'A', 'val2': 'T'},
{'idx1': '4', 'idx2': '6', 'pos': '23', 'val1': 'T', 'val2': 'T'}]),
('3',
[{'idx1': '4', 'idx2': '3', 'pos': '28', 'val1': 'A', 'val2': 'G'},
{'idx1': '4', 'idx2': '3', 'pos': '34', 'val1': 'G', 'val2': 'C'}]),
('4',
[{'idx1': '4', 'idx2': '4', 'pos': '41', 'val1': 'C', 'val2': 'T'},
{'idx1': '4', 'idx2': '4', 'pos': '51', 'val1': 'C', 'val2': 'C'}])]
现在您可以重复使用 input_file
和 file_blocks
的数据。
第二部分
how can I process this groupby object further to merge the values to a list that have common keys within same group/blocks...
Something like orderedDict, defaultDict where the order of the way the data is read is preserved
def collate_data(data):
"""Yield an OrderedDict of merged dictionaries from `data`."""
for idx, item in data:
results = ct.OrderedDict()
dd = ct.defaultdict(list)
for dict_ in item:
for k, v in dict_.items():
dd[k].append(v)
results[idx] = dd
yield results
list(collate_data(file_blocks))
输出
[OrderedDict([('6',
defaultdict(list,
{'idx1': ['4', '4', '4'],
'idx2': ['6', '6', '6'],
'pos': ['11', '15', '23'],
'val1': ['C', 'A', 'T'],
'val2': ['A', 'T', 'T']}))]),
OrderedDict([('3',
defaultdict(list,
{'idx1': ['4', '4'],
'idx2': ['3', '3'],
'pos': ['28', '34'],
'val1': ['A', 'G'],
'val2': ['G', 'C']}))]),
OrderedDict([('4',
defaultdict(list,
{'idx1': ['4', '4'],
'idx2': ['4', '4'],
'pos': ['41', '51'],
'val1': ['C', 'C'],
'val2': ['T', 'C']}))])]
itertools.groupby()
个元素的顺序由 collections.OrderedDict()
维护。 collections.defaultdict()
对象中的列表保留文件各行值的顺序(参见 input_file
中的字典)。