有没有办法在 R 上表示这些序列 SEQ1 和 SEQ2 在单个超长读取中的交替？

Question

我在 R 上有一个与此类似的数据框，只是它有 2000 行长。在整个数据框中，我在称为“id read”的单个读取中交替使用 SEQ1 和 SEQ2。这些序列交替出现，SEQ1总是距SEQ1 1个核苷酸，而SEQ2距SEQ1约335个核苷酸，有时会跳到670个核苷酸。序列既有正向也有反向，从结束坐标的值可以看出，有时会小于起始坐标。

sequence	id read	start	end	sequencedistance	sequencelength
SEQ1	id read	90	105	1	15
SEQ2	id read	440	458	335	18
SEQ1	id read	459	474	1	15
SEQ2	id read	808	826	334	18
SEQ1	id read	827	812	1	15
SEQ2	id read	1148	1156	336	18
SEQ1	id read	1157	1172	1	15
SEQ2	id read	1850	1868	678	18
SEQ1	id read	1869	1854	1	15
SEQ2	id read	2187	2205	333	18
SEQ1	id read	2206	2221	1	15
SEQ2	id read	2887	2905	666	18

有人对如何绘制这些数据并直观地显示这些序列在读取中的模式有什么想法吗？我尝试过使用水平线、棒棒糖、点进行绘图，但是这些方法中的 none 可以有效地表示我拥有的数据量并直观地理解这些序列的行为。有人知道如何绘制图案吗？如果我愿意，我也可以只绘制我拥有的大型数据框的一部分，但至少我想了解这些序列在超长读取中的特殊性。

Answer 1

我仍然不确定您要查找什么，但是如果 i 的每一行 sequence == "SEQ" 都有成对的行 i + 1，其中 sequence == "SEQ2"，您可以计算相对的开始和结束站点，然后尝试将其可视化。

假设您的数据在一个名为 df 的变量中，您可以按如下方式计算这些数据。

df <- transform(
  df,
  rel_start = ifelse(
    as.character(sequence) == "SEQ1",
    start - start,
    start - c(0, head(start, -1))
  ),
  rel_end = ifelse(
    as.character(sequence) == "SEQ1",
    end - start,
    end - c(0, head(start, -1))
  )
)

然后为了可视化，你可以使用geom_segment()。您可以使用箭头指示读取的方向。

library(ggplot2)

ggplot(df, aes(rel_start, y = seq_along(start), colour = sequence)) +
  geom_segment(aes(xend = rel_end, yend = seq_along(start)),
               arrow = arrow(length = unit(2, "mm")))

数据加载：

txt <- "sequence    id read     start   end     sequencedistance    sequencelength
SEQ1    id read     90  105     1   15
SEQ2    id read     440     458     335     18
SEQ1    id read     459     474     1   15
SEQ2    id read     808     826     334     18
SEQ1    id read     827     812     1   15
SEQ2    id read     1148    1156    336     18
SEQ1    id read     1157    1172    1   15
SEQ2    id read     1850    1868    678     18
SEQ1    id read     1869    1854    1   15
SEQ2    id read     2187    2205    333     18
SEQ1    id read     2206    2221    1   15
SEQ2    id read     2887    2905    666     18"

df <- read.table(text = txt, header = TRUE)

有没有办法在 R 上表示这些序列 SEQ1 和 SEQ2 在单个超长读取中的交替？

Is there a way on R to represent the alternation of these sequences, SEQ1 and SEQ2, within a single ultra-long read?

plot

r

ggplot2

dataframe