为什么我的图表（使用 ggplot）因在 R 中使用 as.factor() 而不同？

Question

我正在尝试使用条形图来观察因晋升而离职的员工比例。

数据：结构（列表（促销= c（0、0、0、0、1、1），左= c（0、0、0， 1, 0, 1)), .Names = c("promo", "left"), row.names = c(NA, -6L ), class = "data.frame")

情况一：我用了y = as.factor（左）

 ggplot(data = HR, aes(x = as.factor(promotion), y =as.factor(left), fill = factor(promotion), colour=factor(promotion))) + 
      geom_bar(stat="identity")+
      xlab('Promotion (True or False)')+
      ylab('The Employees that quit')+
      ggtitle('Comparision of Employees that resigned')

这产生了下图。Case 1

情况2：我用了y =（左）

ggplot(data = HR, aes(x = as.factor(promotion), y = (left), fill = factor(promotion), colour=factor(promotion))) + 
      geom_bar(stat="identity")+
      xlab('Promotion (True or False)')+
      ylab('The Employees that quit')+
      ggtitle('Comparision of Employees that resigned')

这产生了下图。 Case 2

造成这种差异的原因是什么？我应该从哪个图表进行推断？

Answer 1

我猜你的数据看起来像这样。将来，能够重复共享您的数据非常好，因此可以 copy/pasted 像这样。（dput() 可用于创建 copy/pasteable 版本的 R 对象定义。）

df = data.frame(promo = c(rep(0, 4), rep(1, 2)),
                left = c(0, 0, 0, 1, 0, 1))
df
#   promo left
# 1     0    0
# 2     0    0
# 3     0    0
# 4     0    1
# 5     1    0
# 6     1    1

您的问题不在于 left 的 factor 程度。不，您的问题实际上是您在 geom_bar() 中指定了 stat = 'identity'。 stat = 'identity' 在数据为 pre-aggregated 时使用，也就是说，当您的数据框具有您想要在图中显示的确切值时。在这种情况下，您的数据有 1 和 0，而不是 1 和 0 各自的总数，因此 stat = 'identity' 是不合适的。

事实上，您根本不应该指定 y 美学，因为您没有包含 y 值的列 - 您的 left 列具有需要的单独值聚合以获得 y 值，当 stat 是 not 'identity'.

时，由 geom_bar 处理

对于计数，图表很简单：

ggplot(df, aes(x = factor(promo), fill = factor(left))) +
           geom_bar()

为了使它在每种情况下都占总数的百分比，我们可以切换到 position = 'fill':

ggplot(df, aes(x = factor(promo), fill = factor(left))) +
           geom_bar(position = 'fill')

如果我对您的数据外观的假设不正确，请在您的问题中提供一些示例数据。数据最好与创建它的代码（如上所述）或通过 dput().

共享

为什么我的图表（使用 ggplot）因在 R 中使用 as.factor() 而不同？

Why does my graph(using ggplot) vary by the use of as.factor() in R?

inference

r

graph

ggplot2