初始化和访问 Python 中的二维字符串数组 3
Initializing and Accessing 2D Arrays of strings in Python 3
我正在尝试从 Python 3 中的 .fasta 文件初始化序列的二维矩阵,但无法弄清楚如何在我可以访问单个字符的地方正确初始化它。
例如。
我要参加:
s1 = "CATTAG"
s2 = "GGTCAC"
然后做类似
的事情
matrix = [s1][s2]
这会形成
x C A T T A G
G
G
T
C
A
C
我可以单独访问矩阵中的元素并更改它们的值,例如,
matrix[0][0] = 0
正在创建...
x C A T T A G
G 0
G
T
C
A
C
非常感谢任何帮助!
假设您可以使用像 pandas
这样的库,如果索引具有唯一值,这是实现您的目标的简单方法:
from itertools import product # standard Python library
from pandas import Series, MultiIndex # you may need to install this package `pandas`
s1 = 'CATG'
s2 = 'GTCA'
idx = MultiIndex.from_tuples(product(s1, s2)
df = Series(data=None, index=idx))
df['C']['G'] = 0
print(df)
但是,由于您提供了具有重复值的索引(即 'CATTAG'
中的两个 'A'
和两个 'T'
),您无法以简单的方式单独索引这些单元格.
该代码仍然有效,但结果可能非常混乱,可能不是您要查找的结果。
如果您只想使用整数进行索引,解决方案类似:
from itertools import product # standard Python library
import pandas as pd # you may need to install this package `pandas`
idx = pd.MultiIndex.from_tuples(product(range(6), range(6)))
df = pd.Series(data=None, index=idx)
df[0][0] = 0
print(df)
一种不同的方法(产生实际的二维矩阵):
import pandas as pd # you may need to install this package `pandas`
df = pd.DataFrame(index=range(6), columns=pd.Series(range(6)))
df[0][0] = 0
print(df)
您似乎想将字符存储为数据,这是可以做到的 - 虽然这似乎是个坏主意,但您可以使用以下替代方法:
import pandas as pd # you may need to install this package `pandas`
s1 = 'CATTAG'
s2 = 'GGTCAC'
df = pd.DataFrame(index=range(len(s1)+1), columns=pd.Series(range(len(s2)+1)))
df.loc[0] = pd.Series(list('x'+s1))
df[0] = pd.Series(list('x'+s2))
df[1][1] = 0
print(df)
输出:
0 1 2 3 4 5 6
0 x C A T T A G
1 G 0 NaN NaN NaN NaN NaN
2 G NaN NaN NaN NaN NaN NaN
3 T NaN NaN NaN NaN NaN NaN
4 C NaN NaN NaN NaN NaN NaN
5 A NaN NaN NaN NaN NaN NaN
6 C NaN NaN NaN NaN NaN NaN
如果您想了解更大数据帧的性能:
import timeit
import random
import pandas as pd # you may need to install this package `pandas`
def define(size):
global df
df = pd.DataFrame(index=range(size), columns=pd.Series(range(size)))
def updates(x, size):
global df
for _ in range(x):
df[random.randint(0, size-1)][random.randint(0, size-1)] = random.randint(0, 100)
print('create:', timeit.timeit(lambda: define(10000), number=2))
print('update:', timeit.timeit(lambda: updates(100000, 10000), number=2))
(每个测试运行两次,10,000 x 10,000 个元素,100,000 个更新)
结果(以秒为单位):
create: 15.3850586
update: 20.7603317
不是闪电般的速度——但也不是年龄。使用 numpy
可以获得更好的性能,但可能会失去一些很好用的功能。最初的问题并没有建议这么大的数据大小。随着 numpy
:
import timeit
import random
import numpy as np # you may need to install this package `numpy`
def define(size):
global arr
arr = np.zeros(shape=(size, size))
def updates(x, size):
global arr
for _ in range(x):
arr[random.randint(0, size-1)][random.randint(0, size-1)] = random.randint(0, 100)
print('create:', timeit.timeit(lambda: define(10000), number=2))
print('update:', timeit.timeit(lambda: updates(100000, 10000), number=2))
快一个数量级:
create: 0.019482200000000005
update: 1.5629257
我正在尝试从 Python 3 中的 .fasta 文件初始化序列的二维矩阵,但无法弄清楚如何在我可以访问单个字符的地方正确初始化它。
例如。 我要参加:
s1 = "CATTAG"
s2 = "GGTCAC"
然后做类似
的事情matrix = [s1][s2]
这会形成
x C A T T A G
G
G
T
C
A
C
我可以单独访问矩阵中的元素并更改它们的值,例如,
matrix[0][0] = 0
正在创建...
x C A T T A G
G 0
G
T
C
A
C
非常感谢任何帮助!
假设您可以使用像 pandas
这样的库,如果索引具有唯一值,这是实现您的目标的简单方法:
from itertools import product # standard Python library
from pandas import Series, MultiIndex # you may need to install this package `pandas`
s1 = 'CATG'
s2 = 'GTCA'
idx = MultiIndex.from_tuples(product(s1, s2)
df = Series(data=None, index=idx))
df['C']['G'] = 0
print(df)
但是,由于您提供了具有重复值的索引(即 'CATTAG'
中的两个 'A'
和两个 'T'
),您无法以简单的方式单独索引这些单元格.
该代码仍然有效,但结果可能非常混乱,可能不是您要查找的结果。
如果您只想使用整数进行索引,解决方案类似:
from itertools import product # standard Python library
import pandas as pd # you may need to install this package `pandas`
idx = pd.MultiIndex.from_tuples(product(range(6), range(6)))
df = pd.Series(data=None, index=idx)
df[0][0] = 0
print(df)
一种不同的方法(产生实际的二维矩阵):
import pandas as pd # you may need to install this package `pandas`
df = pd.DataFrame(index=range(6), columns=pd.Series(range(6)))
df[0][0] = 0
print(df)
您似乎想将字符存储为数据,这是可以做到的 - 虽然这似乎是个坏主意,但您可以使用以下替代方法:
import pandas as pd # you may need to install this package `pandas`
s1 = 'CATTAG'
s2 = 'GGTCAC'
df = pd.DataFrame(index=range(len(s1)+1), columns=pd.Series(range(len(s2)+1)))
df.loc[0] = pd.Series(list('x'+s1))
df[0] = pd.Series(list('x'+s2))
df[1][1] = 0
print(df)
输出:
0 1 2 3 4 5 6
0 x C A T T A G
1 G 0 NaN NaN NaN NaN NaN
2 G NaN NaN NaN NaN NaN NaN
3 T NaN NaN NaN NaN NaN NaN
4 C NaN NaN NaN NaN NaN NaN
5 A NaN NaN NaN NaN NaN NaN
6 C NaN NaN NaN NaN NaN NaN
如果您想了解更大数据帧的性能:
import timeit
import random
import pandas as pd # you may need to install this package `pandas`
def define(size):
global df
df = pd.DataFrame(index=range(size), columns=pd.Series(range(size)))
def updates(x, size):
global df
for _ in range(x):
df[random.randint(0, size-1)][random.randint(0, size-1)] = random.randint(0, 100)
print('create:', timeit.timeit(lambda: define(10000), number=2))
print('update:', timeit.timeit(lambda: updates(100000, 10000), number=2))
(每个测试运行两次,10,000 x 10,000 个元素,100,000 个更新)
结果(以秒为单位):
create: 15.3850586
update: 20.7603317
不是闪电般的速度——但也不是年龄。使用 numpy
可以获得更好的性能,但可能会失去一些很好用的功能。最初的问题并没有建议这么大的数据大小。随着 numpy
:
import timeit
import random
import numpy as np # you may need to install this package `numpy`
def define(size):
global arr
arr = np.zeros(shape=(size, size))
def updates(x, size):
global arr
for _ in range(x):
arr[random.randint(0, size-1)][random.randint(0, size-1)] = random.randint(0, 100)
print('create:', timeit.timeit(lambda: define(10000), number=2))
print('update:', timeit.timeit(lambda: updates(100000, 10000), number=2))
快一个数量级:
create: 0.019482200000000005
update: 1.5629257