Altair/Vega-Lite 热图:筛选前 k 个

Altair/Vega-Lite heatmap: Filter top k

我正在尝试创建热图并根据相对丰度的平均值仅保留前 5 个样本。我能够正确地对热图进行排序,但我无法弄清楚如何只保留前 5 个,在本例中为样本 c、e、b、y、a。我正在用图像粘贴 df 的一个子集。我在 altair-viz 网站上尝试了 "Top K Items Tutorial" link 的无数排列。如果可能的话,我更愿意使用 altair 进行过滤,而不是在 python 代码中过滤 df 本身。

日期范围:

,Lowest_Taxon,p1,p2,p3,p4,p5,p6,p7
0,a,0.03241281,0.0,0.467738067,3.14456785,0.589519651,13.5744323,0.0
1,b,0.680669,9.315121951,2.848404893,13.99058458,2.139737991,16.60779366,7.574639383
2,c,40.65862829,1.244878049,71.01223315,4.82197541,83.18777293,0.0,0.0
3,d,0.0,0.0,0.0,0.548471137,0.272925764,0.925147183,0.0
4,e,0.090755867,13.81853659,5.205085152,27.75721011,1.703056769,19.6691898,12.27775914
5,f,0.0,0.0,0.0,0.0,0.0,0.0,0.0
6,g,0.187994295,0.027317073,0.0,0.0,0.0,0.02242781,0.0
7,h,0.16854661,0.534634146,1.217318302,7.271813154,1.73580786,0.57751612,0.57027843
8,i,0.142616362,2.528780488,1.163348525,0.34279446,0.0,0.0,0.0
9,j,1.711396344,0.694634146,0.251858959,4.273504274,0.087336245,1.721334455,0.899027172
10,k,0.0,1.475121951,0.0,0.0,0.0,5.573310906,0.0
11,l,0.194476857,0.253658537,1.517150396,2.413273002,0.949781659,5.147182506,1.650452868
12,m,0.0,1.736585366,0.0,0.063988299,0.0,8.42724979,0.623951694
13,n,0.0,0.0,0.0,0.0,0.0,0.0,0.0
14,o,4.68689226,0.12097561,0.0,0.0,0.0,0.0,0.0
15,p,0.0,0.885853659,0.0,0.0,0.0,0.913933277,0.046964106
16,q,0.252819914,0.050731707,0.023986568,0.0,0.087336245,0.0,0.0
17,r,0.0,0.0,0.0,0.0,0.0,0.0,0.0
18,s,0.0,0.0,0.0,0.0,0.0,0.0,0.0
19,t,0.0,0.0,0.0,0.0,0.0,0.0,0.0
20,u,0.0,0.058536585,0.089949628,0.356506239,0.0,0.285954584,1.17410265
21,v,0.0,0.0,0.0,0.0,0.0,0.0,0.0
22,w,0.0,0.0,0.0,0.0,0.0,0.0,0.0
23,x,1.471541553,2.396097561,0.593667546,0.278806161,0.065502183,0.280347631,0.952700436
24,y,0.0,0.32,0.0,0.461629873,0.0,7.804878049,18.38980208
25,z,0.0,0.0,0.0,0.0,0.0,0.0,0.0

代码块:

import pandas as pd
import numpy as np
import altair as alt
from vega_datasets import data
from altair_saver import save

# Read in the file and fill empty cells with zero
df = pd.read_excel("path\to\df")

doNotMelt = df.drop(df.iloc[:,1:], axis=1)
df_melted = pd.melt(df, id_vars = doNotMelt, var_name = 'SampleID', value_name = 'Relative_abundance')

# Tell altair to plot as many rows as is necessary
alt.data_transformers.disable_max_rows()

alt.Chart(df_melted).mark_rect().encode(
    alt.X('SampleID:N'),
    alt.Y('Lowest_Taxon:N', sort=alt.EncodingSortField(field='Relative_abundance', op='mean', order='descending')),
    alt.Color('Relative_abundance:Q')
)

如果您知道要显示的是带有 c、e、b、y 和 a 的条目(以后不会更改),您可以简单地在字段 [=13] 上应用 transform_filter =].

如果你想当场计算哪些进入前五,它需要更多的努力,即joinaggregate,window和过滤器转换的组合。

我在下面粘贴了一个示例。顺便说一句,我将您粘贴的原始数据转换为由代码片段导入的 csv 文件。您可以将 pandas 玩具数据作为 dict 提供,然后可以直接在代码中直接读取,从而使其他人更容易使用您的 pandas 玩具数据。

简单方法:

import pandas as pd
import altair as alt
import numpy as np
alt.data_transformers.disable_max_rows()

df = pd.read_csv('df.csv', index_col=0)

doNotMelt = df.drop(df.iloc[:,1:], axis=1)
df_melted = pd.melt(df, id_vars = doNotMelt, var_name = 'SampleID', value_name = 'Relative_abundance')


alt.Chart(df_melted).mark_rect().encode(
    alt.X('SampleID:N'),
    alt.Y('Lowest_Taxon:N', sort=alt.EncodingSortField(field='Relative_abundance', op='mean', order='descending')),
    alt.Color('Relative_abundance:Q')
).transform_filter(
    alt.FieldOneOfPredicate(field='Lowest_Taxon', oneOf=['c', 'e', 'b', 'y', 'a'])
)

灵活的方法:

  • n 设置为您想查看的热门条目数量
import pandas as pd
import altair as alt
import numpy as np
alt.data_transformers.disable_max_rows()

df = pd.read_csv('df.csv', index_col=0)

doNotMelt = df.drop(df.iloc[:,1:], axis=1)
df_melted = pd.melt(df, id_vars = doNotMelt, var_name = 'SampleID', value_name = 'Relative_abundance')

n = 5  # number of entries to display

alt.Chart(df_melted).mark_rect().encode(
    alt.X('SampleID:N'),
    alt.Y('Lowest_Taxon:N', sort=alt.EncodingSortField(field='Relative_abundance', op='mean', order='descending')),
    alt.Color('Relative_abundance:Q')
).transform_joinaggregate(
    mean_rel_ab = 'mean(Relative_abundance)',
    count_of_samples = 'valid(Relative_abundance)',
    groupby = ['Lowest_Taxon']
).transform_window(
    rank='rank(mean_rel_ab)',
    sort=[alt.SortField('mean_rel_ab', order='descending')],
    frame = [None, None]
).transform_filter(
    (alt.datum.rank <= (n-1) * alt.datum.count_of_samples + 1)