如何在存在联系时总结数据集中的前 3 个最高值
How to summarize the top 3 highest values in a dataset when there are ties
我有一个数据框 (my_data),我想只计算 3 个最高值的总和,即使可能存在联系。我是 R 的新手,我用过 dplyr
。
A tibble: 15 x 3
city month number
<chr> <chr> <dbl>
1 Lund jan 12
2 Lund feb 12
3 Lund mar 18
4 Lund apr 28
5 Lund may 28
6 Stockholm jan 15
7 Stockholm feb 15
8 Stockholm mar 30
9 Stockholm apr 30
10 Stockholm may 10
11 Uppsala jan 22
12 Uppsala feb 30
13 Uppsala mar 40
14 Uppsala apr 60
15 Uppsala may 30
这是我试过的代码:
# For each city, count the top 3 of variable number
my_data %>% group_by(city) %>% top_n(3, number) %>% summarise(top_nr = sum(number))
预期(想要)的输出是:
# A tibble: 3 x 2
city top_nr
<chr> <dbl>
1 Lund 86
2 Stockholm 75
3 Uppsala 130
但实际的 R 输出是:
# A tibble: 3 x 2
city top_nr
<chr> <dbl>
1 Lund 86
2 Stockholm 90
3 Uppsala 160
似乎如果有联系,所有联系的值都包含在总和中。我只想计算 3 个具有最高值的唯一实例。
如有任何帮助,我们将不胜感激! :)
我们可以做一个distinct
来删除重复的元素。 top_n
的工作方式是,如果值重复,它将保留那么多重复行
my_data %>%
distinct(city, number, .keep_all = TRUE) %>%
group_by(city) %>%
top_n(3, number) %>%
summarise(top_nr = sum(number))
更新
根据OP的新输出,在top_n
输出后(不是arrange
d),得到'number'降序排列得到sum
的前 3 个 'number'
my_data %>%
group_by(city) %>%
top_n(3, number) %>%
arrange(city, desc(number)) %>%
summarise(number = sum(head(number, 3)))
# A tibble: 3 x 2
# city number
# <chr> <int>
#1 Lund 74
#2 Stockholm 75
#3 Uppsala 130
数据
my_data <- structure(list(city = c("Lund", "Lund", "Lund", "Lund", "Lund",
"Stockholm", "Stockholm", "Stockholm", "Stockholm", "Stockholm",
"Uppsala", "Uppsala", "Uppsala", "Uppsala", "Uppsala"), month = c("jan",
"feb", "mar", "apr", "may", "jan", "feb", "mar", "apr", "may",
"jan", "feb", "mar", "apr", "may"), number = c(12L, 12L, 18L,
28L, 28L, 15L, 15L, 30L, 30L, 10L, 22L, 30L, 40L, 60L, 30L)),
class = "data.frame", row.names = c("1",
"2", "3", "4", "5", "6", "7", "8", "9", "10", "11", "12", "13",
"14", "15"))
这个tidyverse
(实际上,dplyr
)解决方案几乎等于,但是filter
是数据帧而不是top_n
。
library(tidyverse)
my_data %>%
group_by(city) %>%
arrange(desc(number), .by_group = TRUE) %>%
filter(row_number() %in% 1:3) %>%
summarise(top_nr = sum(number))
## A tibble: 3 x 2
# city top_nr
# <chr> <int>
#1 Lund 74
#2 Stockholm 75
#3 Uppsala 130
如果没有 top_n()
,生活可能会更简单:
dat %>%
group_by(city) %>%
summarize(
top_nr = sum(tail(sort(number), 3))
)
我有一个数据框 (my_data),我想只计算 3 个最高值的总和,即使可能存在联系。我是 R 的新手,我用过 dplyr
。
A tibble: 15 x 3
city month number
<chr> <chr> <dbl>
1 Lund jan 12
2 Lund feb 12
3 Lund mar 18
4 Lund apr 28
5 Lund may 28
6 Stockholm jan 15
7 Stockholm feb 15
8 Stockholm mar 30
9 Stockholm apr 30
10 Stockholm may 10
11 Uppsala jan 22
12 Uppsala feb 30
13 Uppsala mar 40
14 Uppsala apr 60
15 Uppsala may 30
这是我试过的代码:
# For each city, count the top 3 of variable number
my_data %>% group_by(city) %>% top_n(3, number) %>% summarise(top_nr = sum(number))
预期(想要)的输出是:
# A tibble: 3 x 2
city top_nr
<chr> <dbl>
1 Lund 86
2 Stockholm 75
3 Uppsala 130
但实际的 R 输出是:
# A tibble: 3 x 2
city top_nr
<chr> <dbl>
1 Lund 86
2 Stockholm 90
3 Uppsala 160
似乎如果有联系,所有联系的值都包含在总和中。我只想计算 3 个具有最高值的唯一实例。
如有任何帮助,我们将不胜感激! :)
我们可以做一个distinct
来删除重复的元素。 top_n
的工作方式是,如果值重复,它将保留那么多重复行
my_data %>%
distinct(city, number, .keep_all = TRUE) %>%
group_by(city) %>%
top_n(3, number) %>%
summarise(top_nr = sum(number))
更新
根据OP的新输出,在top_n
输出后(不是arrange
d),得到'number'降序排列得到sum
的前 3 个 'number'
my_data %>%
group_by(city) %>%
top_n(3, number) %>%
arrange(city, desc(number)) %>%
summarise(number = sum(head(number, 3)))
# A tibble: 3 x 2
# city number
# <chr> <int>
#1 Lund 74
#2 Stockholm 75
#3 Uppsala 130
数据
my_data <- structure(list(city = c("Lund", "Lund", "Lund", "Lund", "Lund",
"Stockholm", "Stockholm", "Stockholm", "Stockholm", "Stockholm",
"Uppsala", "Uppsala", "Uppsala", "Uppsala", "Uppsala"), month = c("jan",
"feb", "mar", "apr", "may", "jan", "feb", "mar", "apr", "may",
"jan", "feb", "mar", "apr", "may"), number = c(12L, 12L, 18L,
28L, 28L, 15L, 15L, 30L, 30L, 10L, 22L, 30L, 40L, 60L, 30L)),
class = "data.frame", row.names = c("1",
"2", "3", "4", "5", "6", "7", "8", "9", "10", "11", "12", "13",
"14", "15"))
这个tidyverse
(实际上,dplyr
)解决方案几乎等于filter
是数据帧而不是top_n
。
library(tidyverse)
my_data %>%
group_by(city) %>%
arrange(desc(number), .by_group = TRUE) %>%
filter(row_number() %in% 1:3) %>%
summarise(top_nr = sum(number))
## A tibble: 3 x 2
# city top_nr
# <chr> <int>
#1 Lund 74
#2 Stockholm 75
#3 Uppsala 130
如果没有 top_n()
,生活可能会更简单:
dat %>%
group_by(city) %>%
summarize(
top_nr = sum(tail(sort(number), 3))
)