ggplot:如何 "correct" 图中出现不具代表性的尖峰
ggplot: how to "correct" an unrepresentative spike in the plot
我有日期时间(日期和 hours:minutes:seconds)的百分比分数数据。我想以图形方式“更正”/突出显示不具有代表性的数据点。
背景
我有关于人们每天如何评价他们的幸福水平的数据,在一个连续的范围内 运行 0 -> 1,其中 0 表示“极度不快乐”,1 表示“极度快乐”。我问了很多人,想随着时间的推移体会到“团里的幸福感”。
数据
library(tidyverse)
library(lubridate)
set.seed(1234)
original_df <-
seq(as.POSIXct('2020-09-01', tz = "UTC"), as.POSIXct('2020-09-15', tz = "UTC"), by="1 mins") %>%
sample(15000, replace = T) %>%
as_tibble %>%
rename(date_time = value) %>%
mutate(date = date(date_time)) %>%
add_column(score = runif(15000))
original_df
## # A tibble: 15,000 x 3
## date_time date score
## <dttm> <date> <dbl>
## 1 2020-09-06 04:11:00 2020-09-06 0.683
## 2 2020-09-06 13:35:00 2020-09-06 0.931
## 3 2020-09-05 23:21:00 2020-09-05 0.121
## 4 2020-09-06 14:45:00 2020-09-06 0.144
## 5 2020-09-07 09:15:00 2020-09-07 0.412
## 6 2020-09-01 10:22:00 2020-09-01 0.564
## 7 2020-09-11 14:00:00 2020-09-11 0.960
## 8 2020-09-08 13:24:00 2020-09-08 0.845
## 9 2020-09-01 15:33:00 2020-09-01 0.225
## 10 2020-09-09 19:27:00 2020-09-09 0.815
## # ... with 14,990 more rows
然而,事实证明其中一天恰好有更少的数据点。因此,实际数据集如下所示:
actual_df <-
original_df %>%
filter(date %in% as_date("2020-09-10")) %>%
group_by(date) %>%
slice_sample(n = 15) %>%
ungroup %>%
bind_rows(original_df %>% filter(!date %in% as_date("2020-09-10")))
> actual_df %>% count(date)
## # A tibble: 14 x 2
## date n
## <date> <int>
## 1 2020-09-01 1073
## 2 2020-09-02 1079
## 3 2020-09-03 1118
## 4 2020-09-04 1036
## 5 2020-09-05 1025
## 6 2020-09-06 1089
## 7 2020-09-07 1040
## 8 2020-09-08 1186
## 9 2020-09-09 1098
## 10 2020-09-10 15 ## <- this day has less data
## 11 2020-09-11 1095
## 12 2020-09-12 1051
## 13 2020-09-13 1037
## 14 2020-09-14 1034
随时间绘制此数据
我一直在做的事情依赖于手段
我把每一天都当成一个因素,求日均值。从统计上讲,这个解决方案可能远非理想,正如@BrianLang 在下面评论的那样。不过,现在我选择的是这个方法。
library(emmeans)
model_fit <-
actual_df %>%
mutate(across(date, factor)) %>%
lm(score ~ date, data = .)
emmeans_fit_data <- emmeans(model_fit, ~ date, CIs = TRUE)
emmeans_fit_data %>%
as_tibble %>%
ggplot(data = ., aes(x = date, y = emmean)) +
geom_line(color = "#1a476f", group = 1, lwd = 1) +
geom_errorbar(aes(ymin = lower.CL, ymax = upper.CL), alpha = 0.5, color = "#90353b", width = 0.2) +
geom_text(aes(label = paste0(round(100*emmean, 1), "%") , color = "90353b"), vjust = -4, hjust = 0.5, size = 3.5) +
geom_point(color = "1a476f") +
scale_y_continuous(labels = function(x) paste0(100*x, "%")) +
ylab("Level of Happiness") +
xlab("Date") +
ggtitle("Mood Over Time") +
theme(plot.title = element_text(hjust = 0.5, size = 14),
axis.text.x=element_text(angle = -60, hjust = 0),
axis.title.x = element_blank(),
legend.title = element_blank(),
plot.caption = element_text(hjust = 0, size = 8),
legend.position = "none")
但是后来我在 2020-09-10 得到了这个峰值,这只是因为数据点数量少。
一种图形解决方案是做一些事情,比如划破有问题的线并“完成”它在有足够数据点的情况下的样子。也许基于前一天和后一天的平均值?我不想删除真实数据,但确实想以图形方式强调这是不具有代表性的,并且实际值应该更接近前一天和后一天。我认为使用虚线是一种合理的图形解决方案。
否则,我希望 modeling/plotting 此类“按时间”数据使用 ggplot
的平滑可以有不同的方法,这将给我一个更平滑的趋势线和信心丝带将说明有问题的一天。但我知道这可能超出了这个问题的范围,所以我只是将它添加为旁注;如果有人想提出基于不同建模的解决方案,而不仅仅是图形更正。但我会感谢任何一个。
不想进入 time-series 模型,您可以想象用受限的三次样条变换时间变量。
我需要更改您的一些代码,这样我就可以避免安装某些软件包的最新版本 ;-)。
请注意,我更改了一些变量名,因为 date
是函数名,不应同时用作变量名。
library(chron)
## added a numeric version of your date variable.
actual_df <- original_df %>%
filter(datez %in% lubridate::date("2020-09-10")) %>%
sample_n(size = 15) %>%
group_by(datez) %>%
ungroup %>%
bind_rows(original_df %>% filter(!datez %in% lubridate::date("2020-09-10"))) %>%
mutate(num_date = as.numeric(datez))
## How many knots across the dates do you want?
number_of_knots = 15
## This is to make sure that visreg is passed the actual knot locations! RMS::RCS does not store them in the model fits.
knots <- paste0("c(", paste0(attr(rms::rcs(actual_df$num_date, number_of_knots), "parms"), collapse = ", "), ")")
## We can construct the formula early.
formula <- as.formula(paste("score ~ rms::rcs(num_date,", knots,")"))
## fit the model as a gaussian glm and pass it to visreg for it's prediction function. This will give you predicted means and 95% CI for that mean. Then I convert the numeric dates back to real dates.
glm_rcs <- glm(data = actual_df, formula = formula, family = "gaussian") %>% visreg::visreg(plot = F) %>% .$fit %>%
mutate(date_date = chron::as.chron(num_date) %>% as.POSIXct())
## plot it!
ggplot(data = glm_rcs, aes(date_date,
y = visregFit)) +
geom_ribbon(aes(ymin = visregLwr, ymax = visregUpr), alpha = .5) +
geom_line()
编辑:您按天收集数据,但您可以在日期中添加抖动,使它们分散在一天中。
actual_df <- original_df %>%
filter(datez %in% lubridate::date("2020-09-10")) %>%
sample_n(size = 15) %>%
group_by(datez) %>%
ungroup %>%
bind_rows(original_df %>% filter(!datez %in% lubridate::date("2020-09-10"))) %>%
mutate(num_date = as.numeric(datez)) %>%
## Here we add random noise (uniform -.5 to .5) to each numeric date.
mutate(jittered_date = num_date + runif(n(), -.5, .5))
## You can lower this number to increase smoothing.
number_of_knots = 15
knots <- paste0("c(", paste0(attr(rms::rcs(actual_df$jittered_date, number_of_knots), "parms"), collapse = ", "), ")")
formula <- as.formula(paste("score ~ rms::rcs(jittered_date,", knots,")"))
glm_rcs <- glm(data = actual_df, formula = formula, family = "gaussian") %>% visreg::visreg(plot = F) %>% .$fit %>%
mutate(date_date = chron::as.chron(jittered_date) %>% as.POSIXct())
ggplot(data = glm_rcs, aes(date_date,
y = visregFit)) +
geom_ribbon(aes(ymin = visregLwr, ymax = visregUpr), alpha = .5) +
geom_line()
编辑 2:
如果您有一个日期时间向量而不是简单的一天,那么 因为 点的抖动不是必需的。
在您使用 lubridate::date()
创建虚假数据的原始代码中,它采用 posix 日期时间向量并剥离到一个简单的日期!您可以通过以下方式避免这种情况:
original_df <- tibble(datez = seq(as.POSIXct('2020-09-01', tz = "UTC"), as.POSIXct('2020-09-15', tz = "UTC"), by="1 mins") %>%
sample(15000, replace = T)) %>%
mutate(datez_day = lubridate::date(datez)) %>%
add_column(score = runif(15000))
actual_df <- original_df %>%
filter(datez_day %in% lubridate::date("2020-09-10")) %>%
sample_n(size = 15) %>%
bind_rows(original_df %>% filter(!datez_day %in% lubridate::date("2020-09-10"))) %>%
mutate(num_date = as.numeric(datez))
现在您有 datez_day
,它是日值,datez
,它是日期时间,num_date
,它是日期时间的数字表示。
从那里你可以直接在 num_date
上建模而不添加任何抖动。
number_of_knots = 20
knots <- paste0("c(", paste0(attr(rms::rcs(actual_df$num_date, number_of_knots), "parms"), collapse = ", "), ")")
formula <- as.formula(paste("score ~ rms::rcs(num_date,", knots,")"))
glm_rcs <- glm(data = actual_df, formula = formula, family = "gaussian") %>%
visreg::visreg(plot = F) %>%
.$fit %>%
as_tibble() %>%
## Translate the num_date back into a datetime object so it is correct in the figures!
mutate(date_date = as.POSIXct.numeric(round(num_date), origin = "1970/01/01"))
ggplot(data = glm_rcs, aes(date_date,
y = visregFit)) +
geom_ribbon(aes(ymin = visregLwr, ymax = visregUpr), alpha = .5) +
geom_line()
我有日期时间(日期和 hours:minutes:seconds)的百分比分数数据。我想以图形方式“更正”/突出显示不具有代表性的数据点。
背景
我有关于人们每天如何评价他们的幸福水平的数据,在一个连续的范围内 运行 0 -> 1,其中 0 表示“极度不快乐”,1 表示“极度快乐”。我问了很多人,想随着时间的推移体会到“团里的幸福感”。
数据
library(tidyverse)
library(lubridate)
set.seed(1234)
original_df <-
seq(as.POSIXct('2020-09-01', tz = "UTC"), as.POSIXct('2020-09-15', tz = "UTC"), by="1 mins") %>%
sample(15000, replace = T) %>%
as_tibble %>%
rename(date_time = value) %>%
mutate(date = date(date_time)) %>%
add_column(score = runif(15000))
original_df
## # A tibble: 15,000 x 3
## date_time date score
## <dttm> <date> <dbl>
## 1 2020-09-06 04:11:00 2020-09-06 0.683
## 2 2020-09-06 13:35:00 2020-09-06 0.931
## 3 2020-09-05 23:21:00 2020-09-05 0.121
## 4 2020-09-06 14:45:00 2020-09-06 0.144
## 5 2020-09-07 09:15:00 2020-09-07 0.412
## 6 2020-09-01 10:22:00 2020-09-01 0.564
## 7 2020-09-11 14:00:00 2020-09-11 0.960
## 8 2020-09-08 13:24:00 2020-09-08 0.845
## 9 2020-09-01 15:33:00 2020-09-01 0.225
## 10 2020-09-09 19:27:00 2020-09-09 0.815
## # ... with 14,990 more rows
然而,事实证明其中一天恰好有更少的数据点。因此,实际数据集如下所示:
actual_df <-
original_df %>%
filter(date %in% as_date("2020-09-10")) %>%
group_by(date) %>%
slice_sample(n = 15) %>%
ungroup %>%
bind_rows(original_df %>% filter(!date %in% as_date("2020-09-10")))
> actual_df %>% count(date)
## # A tibble: 14 x 2
## date n
## <date> <int>
## 1 2020-09-01 1073
## 2 2020-09-02 1079
## 3 2020-09-03 1118
## 4 2020-09-04 1036
## 5 2020-09-05 1025
## 6 2020-09-06 1089
## 7 2020-09-07 1040
## 8 2020-09-08 1186
## 9 2020-09-09 1098
## 10 2020-09-10 15 ## <- this day has less data
## 11 2020-09-11 1095
## 12 2020-09-12 1051
## 13 2020-09-13 1037
## 14 2020-09-14 1034
随时间绘制此数据
我一直在做的事情依赖于手段
我把每一天都当成一个因素,求日均值。从统计上讲,这个解决方案可能远非理想,正如@BrianLang 在下面评论的那样。不过,现在我选择的是这个方法。
library(emmeans)
model_fit <-
actual_df %>%
mutate(across(date, factor)) %>%
lm(score ~ date, data = .)
emmeans_fit_data <- emmeans(model_fit, ~ date, CIs = TRUE)
emmeans_fit_data %>%
as_tibble %>%
ggplot(data = ., aes(x = date, y = emmean)) +
geom_line(color = "#1a476f", group = 1, lwd = 1) +
geom_errorbar(aes(ymin = lower.CL, ymax = upper.CL), alpha = 0.5, color = "#90353b", width = 0.2) +
geom_text(aes(label = paste0(round(100*emmean, 1), "%") , color = "90353b"), vjust = -4, hjust = 0.5, size = 3.5) +
geom_point(color = "1a476f") +
scale_y_continuous(labels = function(x) paste0(100*x, "%")) +
ylab("Level of Happiness") +
xlab("Date") +
ggtitle("Mood Over Time") +
theme(plot.title = element_text(hjust = 0.5, size = 14),
axis.text.x=element_text(angle = -60, hjust = 0),
axis.title.x = element_blank(),
legend.title = element_blank(),
plot.caption = element_text(hjust = 0, size = 8),
legend.position = "none")
但是后来我在 2020-09-10 得到了这个峰值,这只是因为数据点数量少。
一种图形解决方案是做一些事情,比如划破有问题的线并“完成”它在有足够数据点的情况下的样子。也许基于前一天和后一天的平均值?我不想删除真实数据,但确实想以图形方式强调这是不具有代表性的,并且实际值应该更接近前一天和后一天。我认为使用虚线是一种合理的图形解决方案。
否则,我希望 modeling/plotting 此类“按时间”数据使用 ggplot
的平滑可以有不同的方法,这将给我一个更平滑的趋势线和信心丝带将说明有问题的一天。但我知道这可能超出了这个问题的范围,所以我只是将它添加为旁注;如果有人想提出基于不同建模的解决方案,而不仅仅是图形更正。但我会感谢任何一个。
不想进入 time-series 模型,您可以想象用受限的三次样条变换时间变量。
我需要更改您的一些代码,这样我就可以避免安装某些软件包的最新版本 ;-)。
请注意,我更改了一些变量名,因为 date
是函数名,不应同时用作变量名。
library(chron)
## added a numeric version of your date variable.
actual_df <- original_df %>%
filter(datez %in% lubridate::date("2020-09-10")) %>%
sample_n(size = 15) %>%
group_by(datez) %>%
ungroup %>%
bind_rows(original_df %>% filter(!datez %in% lubridate::date("2020-09-10"))) %>%
mutate(num_date = as.numeric(datez))
## How many knots across the dates do you want?
number_of_knots = 15
## This is to make sure that visreg is passed the actual knot locations! RMS::RCS does not store them in the model fits.
knots <- paste0("c(", paste0(attr(rms::rcs(actual_df$num_date, number_of_knots), "parms"), collapse = ", "), ")")
## We can construct the formula early.
formula <- as.formula(paste("score ~ rms::rcs(num_date,", knots,")"))
## fit the model as a gaussian glm and pass it to visreg for it's prediction function. This will give you predicted means and 95% CI for that mean. Then I convert the numeric dates back to real dates.
glm_rcs <- glm(data = actual_df, formula = formula, family = "gaussian") %>% visreg::visreg(plot = F) %>% .$fit %>%
mutate(date_date = chron::as.chron(num_date) %>% as.POSIXct())
## plot it!
ggplot(data = glm_rcs, aes(date_date,
y = visregFit)) +
geom_ribbon(aes(ymin = visregLwr, ymax = visregUpr), alpha = .5) +
geom_line()
编辑:您按天收集数据,但您可以在日期中添加抖动,使它们分散在一天中。
actual_df <- original_df %>%
filter(datez %in% lubridate::date("2020-09-10")) %>%
sample_n(size = 15) %>%
group_by(datez) %>%
ungroup %>%
bind_rows(original_df %>% filter(!datez %in% lubridate::date("2020-09-10"))) %>%
mutate(num_date = as.numeric(datez)) %>%
## Here we add random noise (uniform -.5 to .5) to each numeric date.
mutate(jittered_date = num_date + runif(n(), -.5, .5))
## You can lower this number to increase smoothing.
number_of_knots = 15
knots <- paste0("c(", paste0(attr(rms::rcs(actual_df$jittered_date, number_of_knots), "parms"), collapse = ", "), ")")
formula <- as.formula(paste("score ~ rms::rcs(jittered_date,", knots,")"))
glm_rcs <- glm(data = actual_df, formula = formula, family = "gaussian") %>% visreg::visreg(plot = F) %>% .$fit %>%
mutate(date_date = chron::as.chron(jittered_date) %>% as.POSIXct())
ggplot(data = glm_rcs, aes(date_date,
y = visregFit)) +
geom_ribbon(aes(ymin = visregLwr, ymax = visregUpr), alpha = .5) +
geom_line()
编辑 2:
如果您有一个日期时间向量而不是简单的一天,那么 因为 点的抖动不是必需的。
在您使用 lubridate::date()
创建虚假数据的原始代码中,它采用 posix 日期时间向量并剥离到一个简单的日期!您可以通过以下方式避免这种情况:
original_df <- tibble(datez = seq(as.POSIXct('2020-09-01', tz = "UTC"), as.POSIXct('2020-09-15', tz = "UTC"), by="1 mins") %>%
sample(15000, replace = T)) %>%
mutate(datez_day = lubridate::date(datez)) %>%
add_column(score = runif(15000))
actual_df <- original_df %>%
filter(datez_day %in% lubridate::date("2020-09-10")) %>%
sample_n(size = 15) %>%
bind_rows(original_df %>% filter(!datez_day %in% lubridate::date("2020-09-10"))) %>%
mutate(num_date = as.numeric(datez))
现在您有 datez_day
,它是日值,datez
,它是日期时间,num_date
,它是日期时间的数字表示。
从那里你可以直接在 num_date
上建模而不添加任何抖动。
number_of_knots = 20
knots <- paste0("c(", paste0(attr(rms::rcs(actual_df$num_date, number_of_knots), "parms"), collapse = ", "), ")")
formula <- as.formula(paste("score ~ rms::rcs(num_date,", knots,")"))
glm_rcs <- glm(data = actual_df, formula = formula, family = "gaussian") %>%
visreg::visreg(plot = F) %>%
.$fit %>%
as_tibble() %>%
## Translate the num_date back into a datetime object so it is correct in the figures!
mutate(date_date = as.POSIXct.numeric(round(num_date), origin = "1970/01/01"))
ggplot(data = glm_rcs, aes(date_date,
y = visregFit)) +
geom_ribbon(aes(ymin = visregLwr, ymax = visregUpr), alpha = .5) +
geom_line()