从 Python 中的字符串列表创建一个整数（和元组）列表

Question

我正在尝试从字符串列表创建一个 int（和元组）列表。
让我解释一下我打算做什么，以及是什么让我难以做到。

我的编码计划

一个。我的函数 (myFunc) 将字符串列表作为其参数。

   >>> STRINGS = ['GAT','GAC','ATCG','ATA','GTA']  
   >>> myFunc(STRINGS)

乙。然后，myFunc 以 'special' 的方式排列所有字符，它 returns 一个新的字符列表 (RESULT)。

1) 'GAT' - 第一个字符串

列表中第一个字符串的所有字符都被选中。
结果 = ['G','A','T']

2) 'GAC' - 第二个字符串

'G' 在 'GAC'
下一个字符串是 STRINGS[1]。 (== 'GAC')
将 STRINGS[1][0] 与先前字符串中的所有第一个字符进行比较 (PRIOR)。
在此步骤中，PRIOR 仅包括 'GAC'.
如果 STRINGS[1][0] == STRINGS[0][0]，'GAC' 中的 'G' 不能附加到 RESULT。
'A' 在 'GAC'
'A' 是 'GAC' 中的第二个字符。
检查PRIOR中是否有第二个字符为'A'.
的字符串在这一步中，就是'GAC'。
'C' 在 'GAC'
'C' 是 'GAC' 中的第三个字符。
检查PRIOR中是否有第三个字符为'C'.
的字符串在这一步中，PRIOR.
中没有字符串所以 'C' 可以附加到 RESULT。
结果 = ['G','A','T','C']

3) 对 STRINGS 中所有剩余的字符串重复此过程。

结果 = ['G','A','T','C','A','T','C', 'G','A','T','A']
在这个列表中，我可以给所有字符编号。
数字 = [1,2,3,4,5,6,7,8,9,10,11]

C。将 NUMBERS 和 RESULT 转换为高级数据结构。

我在前面的步骤中得到了结果和数字。
在此步骤中，应将这些列表转换为高级数据结构。

RESULT = ['G','A','T','C','A','T','C','G','A','T','A']  
NUMBERS = [1,2,3,4,5,6,7,8,9,10,11]  
[(0,1), (1,2), (2,3), (2,4), ... ] or {(0,1), (1,2), (2,3), (2,4), ... }  
{(0,1):'G', (1,2):'A', (2,3):'T', (2,4):'C', ...}

我很难实现的。

当字符串的长度变化时，计划可能会很困难。
将字符与先前字符串的字符进行比较并不容易。
将 int 列表转换为元组、Trie、Graph...

# SUMMARY  
# Sorry, this is not a code.
# This shows how a string list is transformed to int (and tuple) list.

# 'GAT'  ->  'G,A,T'  ->  1,2,3   ->  1,2,3  ->  (0,1),(1,2),(2,3)  
# 'GAC'  ->  '-,-,C'  ->  -,-,4   ->  1,2,4  ->  (0,1),(1,2),(2,4)  
# 'ATCG' -> 'A,T,C,G' -> 5,6,7,8  -> 5,6,7,8 ->  (0,5),(5,6),(6,7),(7,8)  
# 'ATA'  ->  '-,-,A'  ->  -,-,9   ->  5,6,9  ->  (0,5),(5,6),(6,9)  
# 'GTA'  ->  '-,T,A'  ->  -,10,11 -> 1,10,11 ->  (0,1),(1,10),(9,11)  

# ['GAT','GAC','ATCG','ATA','GTA']
# -> ['GAT','C','ATCG','A','TA']
# -> ['G','A','T','C','A','T','C','G','A','T','A']
# -> [1,2,3,4,5,6,7,8,9,10,11]
# -> tuple list
# -> change tuple list to ordered set
# -> apply this to Python graph and Trie structures.

我想将其应用于 Python 中的 Graph 和 Trie 结构。如有任何提示或建议，我们将不胜感激。谢谢。

2015.04.15 更新
我写了一段代码从字符串列表中获取一个 int 列表。

def diff_idx(str1, str2):
    """
    Returns a maximum common index number + 1 
    where the characters in both strings are same 
    >>> diff_idx('GAT','GAC')
        2
    """
    for i in range(min(len(str1), len(str2))):
        if str1[i] == str2[i]:
            i += 1
        else:
            return i
    return i

def diff_idxl(xs, x):
    """
    >>> diff_idxl(['GAT','GAC','ATCG','ATA'],'GTA')
        1
    """
    return max([diff_idx(s,x) for s in xs])

def num_seq(patterns):
    """
    >>> num_seq(['GAT','GAC','ATCG','ATA','GTA'])
        ['G', 'A', 'T', 'C', 'A', 'T', 'C', 'G', 'A', 'T', 'A']
    """
    lst = patterns[:]
    answer = [c for c in lst[0]]
    comp = [lst[0]]
    for i in range(1, len(patterns)):
        answer.extend(patterns[i][diff_idxl(comp,patterns[i]):])
        comp.append(patterns[i])
    return answer

我可以用这段代码得到正确的结果。

>>> num_seq(['GAT','GAC','ATCG','ATA','GTA'])
    ['G', 'A', 'T', 'C', 'A', 'T', 'C', 'G', 'A', 'T', 'A']
>>> # (index + 1) means a node in Trie structure.

2015.04.17 更新
我写了一个额外的代码来得到我想要的。

>>> # What I want to get is this... 
>>> strings = ['GAT','GACA','ATC','GATG']
>>> nseq = num_seq(strings)
    ['G','A','T','C','A','A','T','C','G']
>>> make_matrix_trie(strings)
    [[1, 2, 3], [0, 0, 4, 5], [6, 7, 8], [0, 0, 0, 9]]

我对make_matrix的实现是这样的。

def make_matrix_trie(patterns):
    m = []
    for pat in patterns:
        m.append([0]*len(pat))

    comp = num_seq(patterns)
    comp.append(0)

    idx = 1
    for i in range(len(patterns)):
        for j in range(len(patterns[i])):
            if patterns[i][j] == comp[0]:
                m[i][j] = idx
                idx += 1
                comp.pop(0)
            else:
                m[i][j] = 0
            print (m,comp)
    return m

但是结果不是我所期望的。

>>> make_matrix_trie(['GAT','GACA','ATC','GATG'])
    [[1, 2, 3], [0, 0, 4, 5], [6, 7, 8], [9, 0, 0, 0]]
>>> # expected result:
>>> # [[1, 2, 3], [0, 0, 4, 5], [6, 7, 8], [0, 0, 0, 9]]

在一些帮助下，我想我可以更正并完成我的代码。

Answer 1

我还没有想出你的屏蔽和整数分配方案。这与核苷酸有关吗？详细说明会有所帮助。

不过，我可以帮助完成最后一步。这是将整数列表转换为 "tuple lists."

的单行代码

def listToTupleList(l):
    return [(l[i-1],l[i]) if i!=0 else (0,l[i]) for i in range(len(l))]

从 Python 中的字符串列表创建一个整数（和元组）列表

Making a integer (and tuple) list from string list in Python

python

tuples

trie

我的编码计划

一个。我的函数 (myFunc) 将字符串列表作为其参数。

乙。然后，myFunc 以 'special' 的方式排列所有字符，它 returns 一个新的字符列表 (RESULT)。

C。将 NUMBERS 和 RESULT 转换为高级数据结构。

我很难实现的。