如何处理 ggplot2 和离散轴上的重叠标签

Question

ggplot2 似乎没有内置的方法来处理 scatter plots 上文本的过度绘制。但是，我有一个不同的情况，标签是离散轴上的标签，我想知道这里是否有人有比我一直在做的更好的解决方案。

一些示例代码：

library(ggplot2)

#some example data
test.data = data.frame(text = c("A full commitment's what I'm thinking of",
                                "History quickly crashing through your veins",
                                "And I take A deep breath and I get real high",
                                "And again, the Internet is not something that you just dump something on. It's not a big truck."),
                       mean = c(3.5, 3, 5, 4),
                       CI.lower = c(4, 3.5, 5.5, 4.5),
                       CI.upper = c(3, 2.5, 4.5, 3.5))

#plot
ggplot(test.data, aes_string(x = "text", y = "mean")) +
  geom_point(stat="identity") +
  geom_errorbar(aes(ymax = CI.upper, ymin = CI.lower), width = .1) +
  scale_x_discrete(labels = test.data$text, name = "")

所以我们看到 x 轴标签彼此重叠。两个解决方案 spring 要记住：1) 缩写标签，以及 2) 在标签中添加换行符。在很多情况下（1）可以做到，但在某些情况下它不能做到。所以我编写了一个函数，用于在字符串的第 n 个字符中添加换行符 (\n) 以避免名称重叠：

library(ggplot2)

#Inserts newlines into strings every N interval
new_lines_adder = function(test.string, interval){
  #length of str
  string.length = nchar(test.string)
  #split by N char intervals
  split.starts = seq(1,string.length,interval)
  split.ends = c(split.starts[-1]-1,nchar(test.string))
  #split it
  test.string = substring(test.string, split.starts, split.ends)
  #put it back together with newlines
  test.string = paste0(test.string,collapse = "\n")
  return(test.string)
}

#a user-level wrapper that also works on character vectors, data.frames, matrices and factors
add_newlines = function(x, interval) {
  if (class(x) == "data.frame" | class(x) == "matrix" | class(x) == "factor") {
    x = as.vector(x)
  }

  if (length(x) == 1) {
    return(new_lines_adder(x, interval))
  } else {
    t = sapply(x, FUN = new_lines_adder, interval = interval) #apply splitter to each
    names(t) = NULL #remove names
    return(t)
  }
}

#plot again
ggplot(test.data, aes_string(x = "text", y = "mean")) +
  geom_point(stat="identity") +
  geom_errorbar(aes(ymax = CI.upper, ymin = CI.lower), width = .1) +
  scale_x_discrete(labels = add_newlines(test.data$text, 20), name = "")

输出为：

然后可以花一些时间研究间隔大小，以避免标签之间有太多白色-space。

如果标签数量不同，这种解决方案不是很好，因为最佳间隔大小会发生变化。另外，因为普通字体不是 mono-spaced，标签的文本也会对宽度产生影响，所以在选择一个好的间隔时必须格外小心（可以通过使用mono-space 字体，但它们特别宽）。最后，new_lines_adder() 函数的愚蠢之处在于它会以人类不会采用的愚蠢方式将单词一分为二。例如。在上面，它将 "breath" 拆分为 "br\nreath"。可以重写它来避免这个问题。

也可以减小字体大小，但这是对可读性的权衡，通常不需要减小字体大小。

处理这种标签重叠的最佳方法是什么？

Answer 1

我尝试将 new_lines_adder 的不同版本放在一起：

new_lines_adder = function(test.string, interval) {
   #split at spaces
   string.split = strsplit(test.string," ")[[1]]
   # get length of snippets, add one for space
   lens <- nchar(string.split) + 1
   # now the trick: split the text into lines with
   # length of at most interval + 1 (including the spaces)
   lines <- cumsum(lens) %/% (interval + 1)
   # construct the lines
   test.lines <- tapply(string.split,lines,function(line)
      paste0(paste(line,collapse=" "),"\n"),simplify = TRUE)
   # put everything into a single string
   result <- paste(test.lines,collapse="")
   return(result)
}

它仅在空格处拆分行，并确保行最多包含 interval 给定的字符数。这样，您的情节如下所示：

我不会说这是最好的方法。它仍然忽略了并非所有字符都具有相同的宽度。也许使用 strwidth.

可以实现更好的效果

顺便说一句：您可以将 add_newlines 简化为以下内容：

add_newlines = function(x, interval) {

   # make sure, x is a character array   
   x = as.character(x)
   # apply splitter to each
   t = sapply(x, FUN = new_lines_adder, interval = interval,USE.NAMES=FALSE)
   return(t)
}

一开始，as.character确保你有一个字符串。这样做也没有什么坏处，如果你已经有了一个字符串，那么就不需要 if 子句。

另外，下一个 if 子句是不必要的：如果 x 仅包含一个元素，sapply 将完美运行。您可以通过设置 USE.NAMES=FALSE 来隐藏名称，这样您就不需要在额外的行中删除名称。

Answer 2

在@Stibu 回答和评论的基础上，此解决方案考虑了组的数量并使用了 Stibu 开发的智能拆分，同时添加了对用斜杠分隔的单词的修复。

函数：

#Inserts newlines into strings every N interval
new_lines_adder = function(x, interval) {
  #add spaces after /
  x = str_replace_all(x, "/", "/ ")
  #split at spaces
  x.split = strsplit(x, " ")[[1]]
  # get length of snippets, add one for space
  lens <- nchar(x.split) + 1
  # now the trick: split the text into lines with
  # length of at most interval + 1 (including the spaces)
  lines <- cumsum(lens) %/% (interval + 1)
  # construct the lines
  x.lines <- tapply(x.split, lines, function(line)
    paste0(paste(line, collapse=" "), "\n"), simplify = TRUE)
  # put everything into a single string
  result <- paste(x.lines, collapse="")
  #remove spaces we added after /
  result = str_replace_all(result, "/ ", "/")
  return(result)
}

#wrapper for the above, meant for users
add_newlines = function(x, total.length = 85) {
  # make sure, x is a character array   
  x = as.character(x)
  #determine number of groups
  groups = length(x)
  # apply splitter to each
  t = sapply(x, FUN = new_lines_adder, interval = round(total.length/groups), USE.NAMES=FALSE)
  return(t)
}

我为默认输入尝试了一些值，85 是文本结果适合示例数据的值。标签 2 中任何更高的 "veins" 都会向上移动并太靠近第三个标签。

外观如下：

不过，最好使用总文本宽度的真实度量，而不是字符数，因为必须依赖此代理通常意味着标签会浪费很多 space。也许可以在strwidth的基础上重写new_lines_adder()一些代码来处理字符宽度不等的问题。

我不回答这个问题，以防有人能找到解决这个问题的方法。

我已经将这两个功能添加到 my personal package on github，所以任何想使用它们的人都可以从那里获取它们。

如何处理 ggplot2 和离散轴上的重叠标签

How to deal with ggplot2 and overlapping labels on a discrete axis

plot

r

ggplot2

axis-labels