如何从多个块中一次读取两个连续块的数据,直到文件末尾?

How to read the data for two consecutive blocks at a time from multiple blocks until the end of file?

如果你能想到一个好的,请更新标题!

我有以下结构的数据:

chr    pos    A_block    A_val
  2     05       7       A,T,C
  2     11       7       T,C,G
  2     15       7       AT,C,G
  2     21       7       C,A,GT
  2     31       7       T,C,CA
  2     42       9       T,C,G
  2     55       9       C,G,GC
  2     61       9       A,GC,T
  2     05       12       AC,TG,G
  2     11       12       A,TC,TG

预期输出:为了学习,我只想重写输出文件,与输入文件相同,但使用我在下面建议的过程。

我想: step 01: 一次只读取两个连续块的值(前 7 和 9)-> step 02: 存储以 block numbers 作为主要唯一键的字典中的数据 -> step 03: return 该字典到预定义函数进行解析。 -> 现在,读取块 (9 & 12) -> 重复相同的过程直到结束。

我在想:

import req_packages
from collections import defaultdict

''' make a function that takes data from two blocks at a time '''
def parse_two_blocks(someData):
    for key, vals in someData:
        do ... something 
        write the obtained output
        clear memory  # to prevent memory buildup


''' Now, read the input file'''
with open('HaploBlock_toy.txt') as HaploBlocks:
    header = HaploBlocks.readline()  
    # only reads the first line as header

    ''' create a empty dict or default dict. Which ever is better?'''
    Hap_Dict = {}
    Hap_Dict = defaultdict(list)


    ''' for rest of the lines '''
    for lines in HaploBlocks:
        values = lines.strip('\n').split('\t')

        ''' append the data to the dict for unique keys on the for loop, until the number of unique keys is 2 '''
        Block = values[2]
        Hap_Dict[Block].append(values[3])

        do something to count the number of keys - how?
        if keys_count > 2:
           return parse_two_blocks(Hap_Dict)

        elif keys_count < 2 or no new keys: # This one is odd and won't work I know.
           end the program

因此,当代码执行时,它将从块 7 和 9 读取数据,直到字典被填充并 returned 到 pre-defined 函数。解析完成后,它现在可以只保留前一个解析的后一个块中的数据。这样它就只需要读取剩余的块。

预期输出: 我现在的主要问题是一次能够读取两个块。我不想在`[=37=中添加有关如何解析信息的内在细节](someData)' - 这只是另一个问题。但是,让我们尝试重写与输入相同的输出。

将输入解析为动态的块列表(生成器)。迭代对。这一切都应该在您评估对时完成。也就是说,这些行中的 none 应该一次读取或存储整个 csv 文件。

#!/usr/bin/env python3

data = """chr   pos A_block A_val
2   05  7   A,T,C
2   11  7   T,C,G
2   15  7   AT,C,G
2   21  7   C,A,GT
2   31  7   T,C,CA
2   42  9   T,C,G
2   55  9   C,G,GC
2   61  9   A,GC,T
2   05  12  AC,TG,G
2   11  12  A,TC,TG"""

import csv
import io
import itertools
import collections
import operator
from pprint import pprint

def pairwise(iterable):
    "s -> (s0,s1), (s1,s2), (s2, s3), ..."
    a, b = itertools.tee(iterable)
    next(b, None)
    return zip(a, b)

def one():
    # read rows as tuples of values
    c = csv.reader(io.StringIO(data), dialect=csv.excel_tab)
    # read header row
    keys = next(c)
    block_index = keys.index('A_block')
    # group rows by block numbers
    blocks = itertools.groupby(c, key=operator.itemgetter(block_index))
    # extract just the row values for each block
    row_values = (tuple(v) for k, v in blocks)
    # rearrange the values by column
    unzipped_values = (zip(*v) for v in row_values)
    # create a dictionary for each block
    dict_blocks = (dict(zip(keys, v)) for v in unzipped_values)
    yield from pairwise(dict_blocks)


def two():
    c = csv.DictReader(io.StringIO(data), dialect=csv.excel_tab)
    blocks = itertools.groupby(c, key=lambda x: x['A_block'])
    yield from pairwise((k, list(v)) for k, v in blocks)


for a, b in one():
        pprint(a)
        pprint(b)
        print()

输出(one):

{'A_block': ('7', '7', '7', '7', '7'),
 'A_val': ('A,T,C', 'T,C,G', 'AT,C,G', 'C,A,GT', 'T,C,CA'),
 'chr': ('2', '2', '2', '2', '2'),
 'pos': ('05', '11', '15', '21', '31')}
{'A_block': ('9', '9', '9'),
 'A_val': ('T,C,G', 'C,G,GC', 'A,GC,T'),
 'chr': ('2', '2', '2'),
 'pos': ('42', '55', '61')}

{'A_block': ('9', '9', '9'),
 'A_val': ('T,C,G', 'C,G,GC', 'A,GC,T'),
 'chr': ('2', '2', '2'),
 'pos': ('42', '55', '61')}
{'A_block': ('12', '12'),
 'A_val': ('AC,TG,G', 'A,TC,TG'),
 'chr': ('2', '2'),
 'pos': ('05', '11')}

io.StringIO(string)

Take a string and return a file-like object that contains the contents of string.

csv.DictReader(file_object, dialect) from the csv module

Returns an ordered dict for each row where the field names taken from the very first row are used as dictionary keys for the field values.

groupby(iterable, key_function)

Make an iterator that returns consecutive keys and groups from the iterable. The key is a function computing a key value for each element.

lambda x: x['A_block']

A temporary function that takes an input named x and returns the value for the key 'A_block'

(k, list(v)) for k, v in blocks

groupby() returns an iterator (that can only be used once) for the values. This converts that iterator to a list.

pairwise(iterable) recipe

"s -> (s0,s1), (s1,s2), (s2, s3), ..."