使用 ecdf 图表中的值创建 table
Create a table with values from ecdf graph
我正在尝试使用 ecdf 图中的值创建 table。我在下面重新创建了一个示例。
#Data
data(mtcars)
#Sort by mpg
mtcars <- mtcars[order(mtcars$mpg),]
#Make arbitrary ranking variable based on mpg
mtcars <- mtcars %>% mutate(Rank = dense_rank(mpg))
#Make variable for percent picked
mtcars <- mutate(mtcars, Percent_Picked = Rank/max(mtcars$Rank))
#Make cyl categorical
mtcars$cyl<-cut(mtcars$cyl, c(3,5,7,9), right=FALSE, labels=c(4,6,8))
#Make the graph
ggplot(mtcars, aes(Percent_Picked, color = cyl)) +
stat_ecdf(size=1) +
scale_x_continuous(labels = scales::percent) +
scale_y_continuous(labels = scales::percent)
创建此图的原因
我想在整体 Percent_Picked 分别为 25%、50% 和 75% 时为每种圆柱体类型的值创建一个 table。所以显示 4-cylander 为 0%,6 为 28% 左右,8 为 85% 左右。
按组计算分位数并没有给我想要的结果(它显示了当特定圆柱类型的 25%、50% 和 75% 被选中时所有圆柱的百分比)。 (例如,tbradley1013 on their blog 的建议仅有助于每个特定圆柱体的分位数,而不是每个圆柱体在 Percent_Picked 的给定分位数处的整体 cdf。)
如有任何线索,我们将不胜感激!
环顾四周,我发现 . Yours extends this a little by asking for group specific ecdf values, so we can use the do
function in dplyr (here's an example] 可以这样做。比较这个 table 和你的 ggplot 中的值时,值有一些 轻微 差异,我不确定为什么会这样。可能只是 mtcars 数据集有点小,所以如果你 运行 在更大的数据集上这样做,我希望它更接近实际值。
#Sort by mpg
mtcars <- mtcars[order(mtcars$mpg),]
#Make arbitrary ranking variable based on mpg
mtcars <- mtcars %>% mutate(Rank = dense_rank(mpg))
#Make variable for percent picked
mtcars <- mutate(mtcars, Percent_Picked = Rank/max(mtcars$Rank))
#Make cyl categorical
mtcars$cyl<-cut(mtcars$cyl, c(3,5,7,9), right=FALSE, labels=c(4,6,8))
#Make the graph
ggplot(mtcars, aes(Percent_Picked, color = cyl)) +
stat_ecdf(size=1) +
scale_x_continuous(labels = scales::percent) +
scale_y_continuous(labels = scales::percent)
create_ecdf_vals <- function(vec){
df <- data.frame(
x = unique(vec),
y = ecdf(vec)(unique(vec))*length(vec)
) %>%
mutate(y = scale(y, center = min(y), scale = diff(range(y)))) %>%
union_all(data.frame(x=c(0,1),
y=c(0,1))) # adding in max/mins
return(df)
}
mt.ecdf <- mtcars %>%
group_by(cyl) %>%
do(create_ecdf_vals(.$Percent_Picked))
mt.ecdf %>%
summarise(q25 = y[which.max(x[x<=0.25])],
q50 = y[which.max(x[x<=0.5])],
q75 = y[which.max(x[x<=0.75])])
ggplot(mt.ecdf,aes(x,y,color = cyl)) +
geom_step()
~编辑~
在 ggplot2
文档中进行一些挖掘之后,我们实际上可以使用 layer_data
函数显式地从图中提取数据。
my.plt <- ggplot(mtcars, aes(Percent_Picked, color = cyl)) +
stat_ecdf(size=1) +
scale_x_continuous(labels = scales::percent) +
scale_y_continuous(labels = scales::percent)
plt.data <- layer_data(my.plt) # magic happens here
# and here's the table you want
plt.data %>%
group_by(group) %>%
summarise(q25 = y[which.max(x[x<=0.25])],
q50 = y[which.max(x[x<=0.5])],
q75 = y[which.max(x[x<=0.75])])
一个更短的答案,我不敢相信我之前没有看到。本质上,对于每个柱体,我只是将等于或小于 .25、.5 和 .75 的行数除以总行数。
cyl.table<-mtcars %>%
group_by(cyl) %>%
summarise("25% Picked" = sum(Percent_Picked<=0.25)/(sum(Percent_Picked<=1)),
"50% Picked" = sum(Percent_Picked<=0.5)/(sum(Percent_Picked<=1)),
"75% Picked" = sum(Percent_Picked<=0.75)/(sum(Percent_Picked<=1)))
cyl.table
我正在尝试使用 ecdf 图中的值创建 table。我在下面重新创建了一个示例。
#Data data(mtcars) #Sort by mpg mtcars <- mtcars[order(mtcars$mpg),] #Make arbitrary ranking variable based on mpg mtcars <- mtcars %>% mutate(Rank = dense_rank(mpg)) #Make variable for percent picked mtcars <- mutate(mtcars, Percent_Picked = Rank/max(mtcars$Rank)) #Make cyl categorical mtcars$cyl<-cut(mtcars$cyl, c(3,5,7,9), right=FALSE, labels=c(4,6,8)) #Make the graph ggplot(mtcars, aes(Percent_Picked, color = cyl)) + stat_ecdf(size=1) + scale_x_continuous(labels = scales::percent) + scale_y_continuous(labels = scales::percent)
创建此图的原因
我想在整体 Percent_Picked 分别为 25%、50% 和 75% 时为每种圆柱体类型的值创建一个 table。所以显示 4-cylander 为 0%,6 为 28% 左右,8 为 85% 左右。
按组计算分位数并没有给我想要的结果(它显示了当特定圆柱类型的 25%、50% 和 75% 被选中时所有圆柱的百分比)。 (例如,tbradley1013 on their blog 的建议仅有助于每个特定圆柱体的分位数,而不是每个圆柱体在 Percent_Picked 的给定分位数处的整体 cdf。)
如有任何线索,我们将不胜感激!
环顾四周,我发现 do
function in dplyr (here's an example] 可以这样做。比较这个 table 和你的 ggplot 中的值时,值有一些 轻微 差异,我不确定为什么会这样。可能只是 mtcars 数据集有点小,所以如果你 运行 在更大的数据集上这样做,我希望它更接近实际值。
#Sort by mpg
mtcars <- mtcars[order(mtcars$mpg),]
#Make arbitrary ranking variable based on mpg
mtcars <- mtcars %>% mutate(Rank = dense_rank(mpg))
#Make variable for percent picked
mtcars <- mutate(mtcars, Percent_Picked = Rank/max(mtcars$Rank))
#Make cyl categorical
mtcars$cyl<-cut(mtcars$cyl, c(3,5,7,9), right=FALSE, labels=c(4,6,8))
#Make the graph
ggplot(mtcars, aes(Percent_Picked, color = cyl)) +
stat_ecdf(size=1) +
scale_x_continuous(labels = scales::percent) +
scale_y_continuous(labels = scales::percent)
create_ecdf_vals <- function(vec){
df <- data.frame(
x = unique(vec),
y = ecdf(vec)(unique(vec))*length(vec)
) %>%
mutate(y = scale(y, center = min(y), scale = diff(range(y)))) %>%
union_all(data.frame(x=c(0,1),
y=c(0,1))) # adding in max/mins
return(df)
}
mt.ecdf <- mtcars %>%
group_by(cyl) %>%
do(create_ecdf_vals(.$Percent_Picked))
mt.ecdf %>%
summarise(q25 = y[which.max(x[x<=0.25])],
q50 = y[which.max(x[x<=0.5])],
q75 = y[which.max(x[x<=0.75])])
ggplot(mt.ecdf,aes(x,y,color = cyl)) +
geom_step()
~编辑~
在 ggplot2
文档中进行一些挖掘之后,我们实际上可以使用 layer_data
函数显式地从图中提取数据。
my.plt <- ggplot(mtcars, aes(Percent_Picked, color = cyl)) +
stat_ecdf(size=1) +
scale_x_continuous(labels = scales::percent) +
scale_y_continuous(labels = scales::percent)
plt.data <- layer_data(my.plt) # magic happens here
# and here's the table you want
plt.data %>%
group_by(group) %>%
summarise(q25 = y[which.max(x[x<=0.25])],
q50 = y[which.max(x[x<=0.5])],
q75 = y[which.max(x[x<=0.75])])
一个更短的答案,我不敢相信我之前没有看到。本质上,对于每个柱体,我只是将等于或小于 .25、.5 和 .75 的行数除以总行数。
cyl.table<-mtcars %>%
group_by(cyl) %>%
summarise("25% Picked" = sum(Percent_Picked<=0.25)/(sum(Percent_Picked<=1)),
"50% Picked" = sum(Percent_Picked<=0.5)/(sum(Percent_Picked<=1)),
"75% Picked" = sum(Percent_Picked<=0.75)/(sum(Percent_Picked<=1)))
cyl.table