如何按不同列填充直方图？

Question

我在项目中遇到困难，非常感谢您的帮助。我的目标是探索类型（A、B 或 C）与总收入之间的关系。我想在直方图中绘制收入并按类型填充颜色。
我的原始数据如下所示：

ID	year	income	type
x1	2015	300	A
x1	2015	700	C
x1	2016	1000	A
x1	2016	90	B
x1	2016	100	B
x2	2015	2000	A
x2	2015	150	B
x2	2015	500	C
x2	2015	45	C
x2	2016	100	B
x3	2015	111	C

在这种情况下，通过在 x 轴上绘制收入并使用 aes(fill = type)，可以正确填充颜色。在此处查看直方图

h <- ggplot(data, aes(fill=type,x=income))
h+geom_histogram()

然而，在使用第一个 table 时，当年的实际个人收入数据丢失了，因为当我绘制直方图时，每条线都被视为不同的个体。例如，2015 年的 x1 个人收入归因于 300 和 700 个 bin，尽管他当年的总收入为 1000。所以在总结收到的收入和使用的类型后，我得到以下table：

ID	year	income_sum	typeA	typeB	typeC
x1	2015	1000	1	0	1
x1	2016	1190	1	2	0
x2	2015	2695	1	1	2
x2	2016	100	0	1	0
x3	2015	111	0	0	1

h <- ggplot(data2, aes(x=income_sum))
h+geom_histogram()

这一次直方图可以准确的表示总收入，但是没有按照类型（A,B,C）填写三种不同的颜色。请在此处查看直方图。

有没有人知道如何解决这个问题？

Answer 1

你想要像下面这样的东西吗？

library(dplyr)
data %>% group_by(ID,year) %>% summarize(income=sum(income), type=unique(type)) %>%
ggplot(aes(fill=type,x=income)) + geom_histogram()

请注意，在 group_by 之后，您有以下提示：

   ID    year  income type 
  <chr> <chr>  <int> <chr>
1 x1    2015    1000 A    
2 x1    2015    1000 C    
3 x1    2016    1190 A    
4 x1    2016    1190 B    
5 x2    2015    2695 A    
6 x2    2015    2695 B    
7 x2    2015    2695 C    
8 x2    2016     100 B    
9 x3    2015     111 C

[编辑]

如果您希望条形图的高度与其出现的次数成正比，以下应该可行：

df <- data %>% group_by(ID, year, type) %>% 
               summarise(income=sum(income), count = n()) %>% 
               group_by(ID,year) %>% 
               summarize(income=sum(income), type=type, count=count)
df

# A tibble: 9 x 5
# Groups:   ID, year [5]
  ID    year  income type  count
  <chr> <chr>  <int> <chr> <int>
1 x1    2015    1000 A         1
2 x1    2015    1000 C         1
3 x1    2016    1190 A         1
4 x1    2016    1190 B         2
5 x2    2015    2695 A         1
6 x2    2015    2695 B         1
7 x2    2015    2695 C         2
8 x2    2016     100 B         1
9 x3    2015     111 C         1

df %>% ggplot(aes(fill=type, color=type, x=income, y=count)) + 
  geom_bar(stat='identity', width = 50, alpha=0.5)

请注意，有一处不同。由于值 100 和 111 不完全相同（与其他值不同），因此对应于这些值的 B 和 C 的柱不会相互堆叠，而是它们重叠（一个从 100 开始，另一个从 111 开始）。

[EDIT2]

我们还需要分箱来实现你想要的（如果需要改变分箱宽度，目前它设置为 50），

bins <- seq(min(df$income), max(df$income), 50)
df$bin <- sapply(df$income, function(x) max(which(bins <= x)))

df <- df %>%  group_by(bin) %>%
  mutate(income = mean(income), bin=bin) 

df

    ID    year  income type  count   bin
    <chr> <chr>  <dbl> <chr> <int> <int>
  1 x1    2015   1000  A         1    19
  2 x1    2015   1000  C         1    19
  3 x1    2016   1190  A         1    22
  4 x1    2016   1190  B         2    22
  5 x2    2015   2695  A         1    52
  6 x2    2015   2695  B         1    52
  7 x2    2015   2695  C         2    52
  8 x2    2016    106. B         1     1
  9 x3    2015    106. C         1     1

df %>% 
  ggplot(aes(fill=type, color=type, x=income, y=count)) + 
  geom_bar(stat='identity', width = 50)

请注意，箱子的收入设置为箱子内数据点的平均值。

如何按不同列填充直方图？

How to fill histogram by different columns?

r

histogram

fill

ggplot2

dataframe