初始化和访问 Python 中的二维字符串数组 3

Initializing and Accessing 2D Arrays of strings in Python 3

我正在尝试从 Python 3 中的 .fasta 文件初始化序列的二维矩阵,但无法弄清楚如何在我可以访问单个字符的地方正确初始化它。

例如。 我要参加:

s1 = "CATTAG"
s2 = "GGTCAC"

然后做类似

的事情
matrix = [s1][s2]

这会形成

x C A T T A G G G T C A C

我可以单独访问矩阵中的元素并更改它们的值,例如,

matrix[0][0] = 0

正在创建...

x C A T T A G G 0 G T C A C

非常感谢任何帮助!

假设您可以使用像 pandas 这样的库,如果索引具有唯一值,这是实现您的目标的简单方法:

from itertools import product  # standard Python library
from pandas import Series, MultiIndex  # you may need to install this package `pandas`

s1 = 'CATG'
s2 = 'GTCA'
idx = MultiIndex.from_tuples(product(s1, s2)
df = Series(data=None, index=idx))

df['C']['G'] = 0

print(df)

但是,由于您提供了具有重复值的索引(即 'CATTAG' 中的两个 'A' 和两个 'T'),您无法以简单的方式单独索引这些单元格.

该代码仍然有效,但结果可能非常混乱,可能不是您要查找的结果。

如果您只想使用整数进行索引,解决方案类似:

from itertools import product  # standard Python library
import pandas as pd  # you may need to install this package `pandas`

idx = pd.MultiIndex.from_tuples(product(range(6), range(6)))
df = pd.Series(data=None, index=idx)

df[0][0] = 0

print(df)

一种不同的方法(产生实际的二维矩阵):

import pandas as pd  # you may need to install this package `pandas`

df = pd.DataFrame(index=range(6), columns=pd.Series(range(6)))

df[0][0] = 0

print(df)

您似乎想将字符存储为数据,这是可以做到的 - 虽然这似乎是个坏主意,但您可以使用以下替代方法:

import pandas as pd  # you may need to install this package `pandas`

s1 = 'CATTAG'
s2 = 'GGTCAC'
df = pd.DataFrame(index=range(len(s1)+1), columns=pd.Series(range(len(s2)+1)))

df.loc[0] = pd.Series(list('x'+s1))
df[0] = pd.Series(list('x'+s2))
df[1][1] = 0

print(df)

输出:

   0    1    2    3    4    5    6
0  x    C    A    T    T    A    G
1  G    0  NaN  NaN  NaN  NaN  NaN
2  G  NaN  NaN  NaN  NaN  NaN  NaN
3  T  NaN  NaN  NaN  NaN  NaN  NaN
4  C  NaN  NaN  NaN  NaN  NaN  NaN
5  A  NaN  NaN  NaN  NaN  NaN  NaN
6  C  NaN  NaN  NaN  NaN  NaN  NaN

如果您想了解更大数据帧的性能:

import timeit
import random
import pandas as pd  # you may need to install this package `pandas`


def define(size):
    global df
    df = pd.DataFrame(index=range(size), columns=pd.Series(range(size)))


def updates(x, size):
    global df
    for _ in range(x):
        df[random.randint(0, size-1)][random.randint(0, size-1)] = random.randint(0, 100)


print('create:', timeit.timeit(lambda: define(10000), number=2))
print('update:', timeit.timeit(lambda: updates(100000, 10000), number=2))

(每个测试运行两次,10,000 x 10,000 个元素,100,000 个更新)

结果(以秒为单位):

create: 15.3850586
update: 20.7603317

不是闪电般的速度——但也不是年龄。使用 numpy 可以获得更好的性能,但可能会失去一些很好用的功能。最初的问题并没有建议这么大的数据大小。随着 numpy:

import timeit
import random
import numpy as np  # you may need to install this package `numpy`


def define(size):
    global arr
    arr = np.zeros(shape=(size, size))


def updates(x, size):
    global arr
    for _ in range(x):
        arr[random.randint(0, size-1)][random.randint(0, size-1)] = random.randint(0, 100)


print('create:', timeit.timeit(lambda: define(10000), number=2))
print('update:', timeit.timeit(lambda: updates(100000, 10000), number=2))

快一个数量级:

create: 0.019482200000000005
update: 1.5629257