R ggplot2 对数切割,在 x 轴上具有负值和正值,在 y 轴上具有每个 bin 的平均值

R ggplot2 logarithmic cut with negative and positive values on x-axis and mean per bin of y-axis

我正在寻找一种方法来绘制一个变量的平均值在另一个变量(具有正值和负值)的 log2 值的 bin 之间的分布,利用 ggplot2 中更复杂的函数。我想我主要是把它复杂化了,它可能在 ggplot2 精炼选项中进行了硬编码,但我无法正确处理,所以在回到基础知识之前,我想我可能会尝试在这里学习如何应用这些功能。

value <- rnorm(1000,0,20)
dist = c(rep(0, 15), sample(1:490), sample(-1:-495))
data = data.frame(value=value, dist=dist)

data$log=log2(abs(data$dist)+1)
# re-lable the x-axis: 
data$Labels=2^(abs(data$log))-1

data$bins=cut(data$log, breaks=10)
# Try to recover the negative log after transformation
data$sign=ifelse(data$dist==0, 0, ifelse(data$dist>0, "+", "-"))

# find the average expression of value per each bin
data=with(data, aggregate(data$value, by = list(bins, sign), FUN =    function(x) c(mn =mean(x), n=length(x) )))
data= as.data.frame(as.list(data))
names(data)=c("bins", "sign", "mean", "length")

# I am doing this in a very contorted way to try to achieve what I would like which is something like this:

bin_num = do.call("rbind", lapply(strsplit(sapply(as.character(data$bins), function(x) substr(x, 2, nchar(x)-1)), ","), as.numeric))
data$bin_num=bin_num[,1]
data$bin_num=ifelse(data$sign==0, 0, ifelse(data$sign=="-", -data$bin_num, data$bin_num))
data = data[order(data$bin_num),]

data <- transform(data, x2 = factor(paste(sign, bins)))
data <- transform(data, x2 = reorder(x2, rank(bin_num)))

# Line plot to show the distribution of the means across the bins of log2 of x:
ggplot(data, aes(y = mean, x = bin_num, group=1)) +  geom_point() + geom_line()

#然后我试图通过添加标签来重新标记这里的对数变换,但是当然它不起作用:

ggplot(data, aes(y = mean, x = bin_num, group=1)) +  geom_point() + geom_line() + scale_x_discrete(labels=data$dist, breaks=data$bin_num)

我看到 ggplot2 具有直接计算平均值的功能,所以我可能甚至不需要前面的命令。我试过了:

ggplot(data, aes(x = bins, y = mean)) + stat_summary(fun.y = "mean") +     geom_line() + scale_x_continuous(breaks = labels)

但当然它不起作用...我还看到 ggplo2 具有自动帮助对数标记的功能,而不是我在这里使用的功能,但我不知道如何在有要记录的负值。另一个问题 here 中有一个非常好的函数可以转换这两个值,但我认为它在这个阶段没有用。非常感谢您就如何解决这个问题提出任何建议……真的很感激!

答案的第一个版本,使用 data.table 以获得更快的速度和更好的可读性:

代码用更短更快的代码重现问题

library(data.table)

# function that returns the lower bound of a cut
lower.bound <- function(x, n) {
  c <- cut(x, n)
  tmp <- substr(x = c, start = 2, stop = regexpr(",", c) - 1)
  return(as.numeric(tmp))
}

nbin <- 10
set.seed(123)
dat <- data.table(value = rnorm(1000,0, 20),
                  dist = c(rep(0, 15), sample(1:490), sample(-1:-495)))

dat[, log := log2(abs(dist) + 1)]
dat[, labels := 2^(abs(log))]
dat[, sign := ifelse(dist == 0, 
                     0,
                     ifelse(dist > 0, "+", "-"))]

dat[, bin := ifelse(sign == 0, 
                    0,
                    ifelse(sign == "+", 
                           lower.bound(log, nbin),
                           -lower.bound(log, nbin)))]

sumdat <- dat[, .(mvalue = mean(value),
                  nvalue = .N,
                  ylab = mean(dist)), 
                 by = .(bin, sign)][order(bin)]

ggplot(sumdat, aes(x = ylab, y = mvalue)) + geom_line()