R 中密度/直方图的数据可视化说明

Question

我正在使用 Kaggle 的 Kickstarter Dataset，我想用 ggplot 创建有意义的可视化，关于如何显示有关认捐比率的项目数据（这是我添加的一个字段，这是通过将每个项目的美元承诺金额除以美元目标金额计算得出的。

要复制我在 R 中使用的数据集，请使用以下代码：

if(!require(tidyverse)) install.packages("tidyverse", repos = "http://cran.us.r-project.org")
if(!require(ggplot2)) install.packages("ggplot2", repos = "http://cran.us.r-project.org")
if(!require(dplyr)) install.packages("dplyr", repos = "http://cran.us.r-project.org")

library(tidyverse)
library(ggplot2)
library(dplyr)

file_path <- "https://raw.githubusercontent.com/mganopolsky/kickstarter/master/data/ks-projects-201801.csv"
data  <-read_csv(file_path)


ds <- data %>% dplyr::select(-`usd pledged`)

ds <- ds %>% mutate(time_int = as.numeric(deadline - as.Date(launched)) ,
                    launched = as.Date(launched),
                    pledged_ratio = round(usd_pledged_real / usd_goal_real, 2),
                    avg_backer_pldg = ifelse(backers == 0, 0, round(usd_pledged_real/backers) )) %>%
  mutate(launched_month = as.factor(format(launched, "%m")),
         launched_day_of_week = as.factor(format(launched, "%u")  ),
         currency = as.factor(currency),
         launched_year = as.factor(format(launched, "%Y")))


ds <- ds %>% filter(launched >= "2009-04-21")

在这一点上，我想直观地了解我们可以跨项目看到什么样的 pledge_ratio。这个数据可以用下面的代码查看：

ds %>% filter(state=="successful" ) %>% group_by(pledged_ratio) %>% summarise( pledged_ratio_count = n()) %>%
  arrange(desc(pledged_ratio))

这可以让您了解有多少项目属于特定比例 - 但是，这个数字并没有多大意义。某种分箱显示会更可取 - 例如，使用 geom_histogram()，甚至 geom_density().

当我运行密度图时，结果是这样的：

ds %>% filter(state=="successful" ) %>% 
  arrange(desc(pledged_ratio))  %>% ggplot(aes(pledged_ratio)) + geom_density() + 
  ggtitle("Density Distribution of Pledge Ratios for Succeessful Projects") + xlab("Pledge Ratios")

一旦你盯着它看一会儿，这是有道理的，因为大多数项目获得资金的比例都在 100% 左右，或者比率为 1。但是，有些项目的资金比例要高得多，我想要一个可视化，以一种并非毫无意义的方式显示这一点。

我已经用直方图试过了：

ds %>% filter(state=="successful" ) %>% 
  arrange(desc(pledged_ratio))  %>% ggplot(aes(pledged_ratio)) + geom_histogram(bins = 20)

这产生了另一个有点无意义的直方图：

最后，使用 geom_point() 我得到了这个：

ds %>% filter(state=="successful" ) %>% group_by(pledged_ratio) %>% summarise( pledged_ratio_count = n()) %>%
  arrange(desc(pledged_ratio))  %>% ggplot(aes(pledged_ratio, y=pledged_ratio_count)) + geom_point()

这可能是迄今为止最有洞察力的图表。 :

但是，我仍然相信必须有更好的方式来传达数据所传达的信息。任何建议将不胜感激。

Answer 1

经验 CDF 怎么样？

library(scales)
ds %>% filter(state=="successful") %>% 
  ggplot(aes(x=pledged_ratio)) + 
  stat_ecdf() + 
  scale_x_continuous(trans="pseudo_log", breaks = c(10, 100, 1000, 10000, 100000), labels=comma) + 
  scale_y_continuous(labels=percent) + 
  theme_bw() + 
  labs(x="Pledged Ratio", y="Percentage of Projects")

R 中密度/直方图的数据可视化说明

Data Visualization Clarification in R for a density / histogram plot

r

data-visualization

histogram

ggplot2

density-plot