在堆叠条形图中将出现频率绘制为字母高度

Plotting occurrences frequency as letters height in a stacked barchart

在字典列表(位置作为索引)中(字母作为键)我有:

Position 0
L 458 6.81
K 238 3.54
A 676 10.06
T 738 10.98
G 390 5.8
N 190 2.83
! 8 0.12
S 798 11.87
D 137 2.04
M 76 1.13
R 222 3.3
F 168 2.5
Q 297 4.42
I 333 4.95
P 916 13.63
H 102 1.52
C 46 0.68
E 184 2.74
V 619 9.21
W 25 0.37
Y 101 1.5


Position 1
G 419 6.23
S 822 12.23
P 1019 15.16
A 719 10.7
N 239 3.56
F 154 2.29
! 6 0.09
M 75 1.12
T 637 9.48
V 524 7.8
Q 359 5.34
R 207 3.08
L 449 6.68
C 36 0.54
E 191 2.84
Y 90 1.34
K 268 3.99
I 246 3.66
H 101 1.5
D 145 2.16
W 16 0.24


Position 2
K 285 4.24
L 358 5.33
S 906 13.48
E 165 2.45
R 257 3.82
M 63 0.94
G 395 5.88
A 657 9.77
V 788 11.72
T 896 13.33
W 27 0.4
C 48 0.71
H 106 1.58
Q 251 3.73
F 204 3.03
P 578 8.6
D 135 2.01
I 288 4.28
Y 128 1.9
N 187 2.78


Position 3
S 3869 57.56
T 2845 42.32
I 1 0.01
K 1 0.01
A 1 0.01
V 3 0.04
G 2 0.03


Position 4
L 479 7.13
E 297 4.42
F 177 2.63
V 479 7.13
D 153 2.28
K 280 4.17
S 1107 16.47
P 488 7.26
A 629 9.36
T 731 10.87
W 40 0.6
R 224 3.33
I 239 3.56
Y 131 1.95
Q 409 6.08
N 189 2.81
G 442 6.58
M 83 1.23
C 51 0.76
H 89 1.32
! 5 0.07


Position 5
T 632 9.4
R 154 2.29
S 1067 15.87
Q 310 4.61
L 400 5.95
N 180 2.68
E 262 3.9
A 935 13.91
P 725 10.79
G 531 7.9
Y 115 1.71
V 433 6.44
W 27 0.4
H 108 1.61
K 178 2.65
C 43 0.64
D 174 2.59
M 72 1.07
F 163 2.42
I 191 2.84
! 22 0.33


Position 6
E 290 4.31
A 606 9.02
S 1093 16.26
F 189 2.81
R 202 3.01
I 197 2.93
G 511 7.6
T 658 9.79
K 237 3.53
H 103 1.53
L 412 6.13
P 615 9.15
M 75 1.12
! 37 0.55
Q 369 5.49
V 452 6.72
C 36 0.54
D 198 2.95
N 283 4.21
W 35 0.52
Y 124 1.84

我要:

上图表示每个字母在给定位置的频率(上面数据中的字段 3)。字母的高度编码频率。

我知道如何制作这样的东西:

但我不知道有什么或多或少简单的东西可以用字母和可变高度来制作同样的东西。

感谢您的帮助

您要查找的绘图类型的名称是 sequence logo. There is an online application to create such sequence logos,但它们是从多重比对或序列文件开始的。您有频率数据,因此我们需要创建自己的数据。

首先安装必要的包:

git clone https://github.com/saketkc/pyseqlogo.git
pip install biopython==1.77 # biopython 1.78 no longer has Bio.Alphabet

你说你的数据结构是一个字典列表,但你 post 在你的问题中是一些纯文本。所以首先我必须做一些预处理。

>>> data = """Position 0
L 458 6.81
K 238 3.54
A 676 10.06
T 738 10.98
G 390 5.8
N 190 2.83
! 8 0.12
S 798 11.87
D 137 2.04
M 76 1.13
R 222 3.3
F 168 2.5
Q 297 4.42
I 333 4.95
P 916 13.63
H 102 1.52
C 46 0.68
E 184 2.74
V 619 9.21
W 25 0.37
Y 101 1.5


Position 1
G 419 6.23
S 822 12.23
P 1019 15.16
A 719 10.7
N 239 3.56
F 154 2.29
! 6 0.09
M 75 1.12
T 637 9.48
V 524 7.8
Q 359 5.34
R 207 3.08
L 449 6.68
C 36 0.54
E 191 2.84
Y 90 1.34
K 268 3.99
I 246 3.66
H 101 1.5
D 145 2.16
W 16 0.24


Position 2
K 285 4.24
L 358 5.33
S 906 13.48
E 165 2.45
R 257 3.82
M 63 0.94
G 395 5.88
A 657 9.77
V 788 11.72
T 896 13.33
W 27 0.4
C 48 0.71
H 106 1.58
Q 251 3.73
F 204 3.03
P 578 8.6
D 135 2.01
I 288 4.28
Y 128 1.9
N 187 2.78


Position 3
S 3869 57.56
T 2845 42.32
I 1 0.01
K 1 0.01
A 1 0.01
V 3 0.04
G 2 0.03


Position 4
L 479 7.13
E 297 4.42
F 177 2.63
V 479 7.13
D 153 2.28
K 280 4.17
S 1107 16.47
P 488 7.26
A 629 9.36
T 731 10.87
W 40 0.6
R 224 3.33
I 239 3.56
Y 131 1.95
Q 409 6.08
N 189 2.81
G 442 6.58
M 83 1.23
C 51 0.76
H 89 1.32
! 5 0.07


Position 5
T 632 9.4
R 154 2.29
S 1067 15.87
Q 310 4.61
L 400 5.95
N 180 2.68
E 262 3.9
A 935 13.91
P 725 10.79
G 531 7.9
Y 115 1.71
V 433 6.44
W 27 0.4
H 108 1.61
K 178 2.65
C 43 0.64
D 174 2.59
M 72 1.07
F 163 2.42
I 191 2.84
! 22 0.33


Position 6
E 290 4.31
A 606 9.02
S 1093 16.26
F 189 2.81
R 202 3.01
I 197 2.93
G 511 7.6
T 658 9.79
K 237 3.53
H 103 1.53
L 412 6.13
P 615 9.15
M 75 1.12
! 37 0.55
Q 369 5.49
V 452 6.72
C 36 0.54
D 198 2.95
N 283 4.21
W 35 0.52
Y 124 1.84"""

并非所有位置都有相同数量的字母,因此请添加频率为 0 的缺失字母。同时除以 100 以获得 0 和 1 之间的数字。

>>> keys = ['L', 'K', 'A', 'T', 'G', 'N', '!', 'S', 'D', 'M', 'R', 'F', 'Q', 'I', 'P', 'H', 'C', 'E', 'V', 'W', 'Y']

>>> all_scores = []
>>> for position in data.split('\n\n\n'):
      lines = position.splitlines()[1:]
      scores = [(line.split()[0], float(line.split()[-1]) / 100) for line in lines]
      if len(scores) != len(keys):
        scores += [(key, 0.0) for key in keys if key not in [s[0] for s in scores]]
      all_scores.append(scores)

>>> from pprint import pprint
>>> pprint(all_scores)
[[('L', 0.0681),
  ('K', 0.0354),
  ('A', 0.10060000000000001),
  ('T', 0.10980000000000001),
  ('G', 0.057999999999999996),
  ('N', 0.028300000000000002),
  ('!', 0.0012),
  ('S', 0.11869999999999999),
  ('D', 0.0204),
  ('M', 0.0113),
  ('R', 0.033),
  ('F', 0.025),
  ('Q', 0.044199999999999996),
  ('I', 0.0495),
  ('P', 0.1363),
  ('H', 0.0152),
  ('C', 0.0068000000000000005),
  ('E', 0.0274),
  ('V', 0.09210000000000002),
  ('W', 0.0037),
  ('Y', 0.015)],
 [('G', 0.0623),
  ('S', 0.1223),
  ('P', 0.1516),
  ('A', 0.107),
  ('N', 0.0356),
  ('F', 0.0229),
  ('!', 0.0009),
  ('M', 0.011200000000000002),
  ('T', 0.09480000000000001),
  ('V', 0.078),
  ('Q', 0.053399999999999996),
  ('R', 0.0308),
  ('L', 0.0668),
  ('C', 0.0054),
  ('E', 0.028399999999999998),
  ('Y', 0.0134),
  ('K', 0.039900000000000005),
  ('I', 0.0366),
  ('H', 0.015),
  ('D', 0.0216),
  ('W', 0.0024)],
 [('K', 0.0424),
  ('L', 0.0533),
  ('S', 0.1348),
  ('E', 0.0245),
  ('R', 0.0382),
  ('M', 0.009399999999999999),
  ('G', 0.0588),
  ('A', 0.0977),
  ('V', 0.11720000000000001),
  ('T', 0.1333),
  ('W', 0.004),
  ('C', 0.0070999999999999995),
  ('H', 0.0158),
  ('Q', 0.0373),
  ('F', 0.030299999999999997),
  ('P', 0.086),
  ('D', 0.020099999999999996),
  ('I', 0.042800000000000005),
  ('Y', 0.019),
  ('N', 0.0278),
  ('!', 0.0)],
 [('S', 0.5756),
  ('T', 0.4232),
  ('I', 0.0001),
  ('K', 0.0001),
  ('A', 0.0001),
  ('V', 0.0004),
  ('G', 0.0003),
  ('L', 0.0),
  ('N', 0.0),
  ('!', 0.0),
  ('D', 0.0),
  ('M', 0.0),
  ('R', 0.0),
  ('F', 0.0),
  ('Q', 0.0),
  ('P', 0.0),
  ('H', 0.0),
  ('C', 0.0),
  ('E', 0.0),
  ('W', 0.0),
  ('Y', 0.0)],
 [('L', 0.0713),
  ('E', 0.044199999999999996),
  ('F', 0.0263),
  ('V', 0.0713),
  ('D', 0.022799999999999997),
  ('K', 0.0417),
  ('S', 0.16469999999999999),
  ('P', 0.0726),
  ('A', 0.09359999999999999),
  ('T', 0.10869999999999999),
  ('W', 0.006),
  ('R', 0.0333),
  ('I', 0.0356),
  ('Y', 0.0195),
  ('Q', 0.0608),
  ('N', 0.0281),
  ('G', 0.0658),
  ('M', 0.0123),
  ('C', 0.0076),
  ('H', 0.0132),
  ('!', 0.0007000000000000001)],
 [('T', 0.094),
  ('R', 0.0229),
  ('S', 0.15869999999999998),
  ('Q', 0.0461),
  ('L', 0.059500000000000004),
  ('N', 0.0268),
  ('E', 0.039),
  ('A', 0.1391),
  ('P', 0.1079),
  ('G', 0.079),
  ('Y', 0.0171),
  ('V', 0.0644),
  ('W', 0.004),
  ('H', 0.0161),
  ('K', 0.0265),
  ('C', 0.0064),
  ('D', 0.0259),
  ('M', 0.010700000000000001),
  ('F', 0.0242),
  ('I', 0.028399999999999998),
  ('!', 0.0033)],
 [('E', 0.0431),
  ('A', 0.0902),
  ('S', 0.16260000000000002),
  ('F', 0.0281),
  ('R', 0.0301),
  ('I', 0.029300000000000003),
  ('G', 0.076),
  ('T', 0.09789999999999999),
  ('K', 0.0353),
  ('H', 0.015300000000000001),
  ('L', 0.0613),
  ('P', 0.0915),
  ('M', 0.011200000000000002),
  ('!', 0.0055000000000000005),
  ('Q', 0.054900000000000004),
  ('V', 0.0672),
  ('C', 0.0054),
  ('D', 0.029500000000000002),
  ('N', 0.0421),
  ('W', 0.0052),
  ('Y', 0.0184)]]

! 不存在于任何 default colorschemes 中,因此我们将其添加到 hydrophobicity 中。

>>> colorscheme = {
    'R': 'blue',
    'K': 'blue',
    'D': 'blue',
    'E': 'blue',
    'N': 'blue',
    'Q': 'blue',
    'S': 'darkgreen',
    'G': 'darkgreen',
    'H': 'darkgreen',
    'T': 'darkgreen',
    'A': 'darkgreen',
    'P': 'darkgreen',
    'Y': 'black',
    'V': 'black',
    'M': 'black',
    'C': 'black',
    'L': 'black',
    'F': 'black',
    'I': 'black',
    'W': 'black',
    '!': 'black'
}

现在绘制序列标识:

>>> from pyseqlogo.pyseqlogo import draw_logo
>>> import matplotlib.pyplot as plt
>>> plt.rcParams['figure.dpi'] = 300
>>> fig, axarr = draw_logo(all_scores, colorscheme=colorscheme) 
>>> fig.tight_layout()