将 tibble 汇总到多行输出
Summarise tibble to multiple rows of output
假设我在 R 中有以下小标题:
activation_date | country | campaign | revenue | users
======================================================
1 | 1 | 1 | R_1 | U_1
2 | 1 | 1 | R_2 | U_2
3 | 1 | 1 | R_3 | U_3
1 | 1 | 2 | R_4 | U_4
2 | 1 | 2 | R_5 | U_5
3 | 1 | 2 | R_6 | U_6
1 | 2 | 3 | R_7 | U_7
2 | 2 | 3 | R_8 | U_8
3 | 2 | 3 | R_9 | U_9
我想按国家/地区对这个小标题进行分组并汇总其数据以将这个小标题作为其输出:
country | campaign | ltv
==========================
1 | 1 | ltv_1
1 | 2 | ltv_2
2 | 3 | ltv_3
但是,我希望 ltv_1
和 ltv_2
都使用 R_1
到 R_6
和 [=19] =]到U_6
联合计算,ltv_3
用R_7
到R_9
和U_7
到U_9
计算。
我不能 group_by
"country" 和 summarise
,因为那去掉了我想保留的 "campaign" 列,但我不能 group_by
"country" 和 "campaign" 要么因为那样我就不能使用前三行来帮助计算 ltv_2
也不能使用后面的三行来帮助计算 ltv_1
.
一种可能的方法是按 "country" 分组并使用 group_modify
函数生成分组的输出小标题。但是,该功能处于 "experimental" 阶段,所以我不想过分依赖它。有没有不同的、既定的方法来做到这一点?
一个示例输入小标题是:
# A tibble: 9 x 5
activation_date country campaign revenue users
<dbl> <dbl> <dbl> <dbl> <dbl>
1 1 1 1 1 11
2 2 1 1 2 12
3 3 1 1 3 13
4 1 1 2 4 14
5 2 1 2 5 15
6 3 1 2 6 16
7 1 2 3 7 17
8 2 2 3 8 18
9 3 2 3 9 19
其输出为:
# A tibble: 3 x 3
country campaign ltv
<dbl> <dbl> <dbl>
1 1 1 0.213
2 1 2 0.296
3 2 3 0.444
使用 group_modify
函数生成它的代码是:
test_tibble = tribble (~ activation_date, ~ country, ~ campaign, ~ revenue, ~ users,
1, 1, 1, 1, 11,
2, 1, 1, 2, 12,
3, 1, 1, 3, 13,
1, 1, 2, 4, 14,
2, 1, 2, 5, 15,
3, 1, 2, 6, 16,
1, 2, 3, 7, 17,
2, 2, 3, 8, 18,
3, 2, 3, 9, 19)
test_function = function (activation_date, campaign, revenue, users) {
total_ltv = sum (revenue) / sum (users)
campaign_ltv = double (0)
campaign_names = unique (campaign)
for (c in campaign_names) {
campaign_ltv = c (campaign_ltv, sum (revenue [campaign == c]) / sum (users [campaign == c]))
}
return (tibble (campaign = campaign_names,
ltv = campaign_ltv / 2 + total_ltv / 2))
}
test_tibble %>%
group_by (country) %>%
group_modify (~ test_function (.x$activation_date, .x$campaign, .x$revenue, .x$users)) %>%
ungroup
选项 1 -
有点冗长但透明 这样做的方法是joins
。但是,考虑到 test_function
中的代码也不是那么冗长。 -
test_tibble %>%
group_by(country, campaign) %>%
summarize(campaign_ltv = sum(revenue)/sum(users)) %>%
inner_join(
test_tibble %>%
group_by(country) %>%
summarise(total_ltv = sum(revenue)/sum(users)),
by = "country"
) %>%
mutate(ltv = (total_ltv + campaign_ltv)/2) %>%
ungroup()
# A tibble: 3 x 5
country campaign campaign_ltv total_ltv ltv
<dbl> <dbl> <dbl> <dbl> <dbl>
1 1 1 0.167 0.259 0.213
2 1 2 0.333 0.259 0.296
3 2 3 0.444 0.444 0.444
选项 2) -
将 test_function
输出包装在 list
中以获得嵌套的小标题并使用 unnest
.
test_tibble %>%
group_by (country) %>%
mutate(
ltv = list(test_function(activation_date, campaign, revenue, users))
) %>%
select(country, ltv) %>%
filter(row_number() == 1) %>%
unnest() %>%
ungroup()
# A tibble: 3 x 3
country campaign ltv
<dbl> <dbl> <dbl>
1 1 1 0.213
2 1 2 0.296
3 2 3 0.444
选项 3) -
df %>%
group_by(country) %>%
tidyr::complete(nesting(country, campaign), nesting(revenue, users)) %>%
group_by(campaign, add = TRUE)
# now you have all revenue and users for each country-campaign
# for total_ltv: use revenue and users as is
# for campaign_ltv: use revenue and users where activation_date is not NA
# A tibble: 15 x 5
# Groups: country, campaign [3]
country campaign revenue users activation_date
<int> <int> <chr> <chr> <int>
1 1 1 R_1 U_1 1
2 1 1 R_2 U_2 2
3 1 1 R_3 U_3 3
4 1 1 R_4 U_4 NA
5 1 1 R_5 U_5 NA
6 1 1 R_6 U_6 NA
7 1 2 R_1 U_1 NA
8 1 2 R_2 U_2 NA
9 1 2 R_3 U_3 NA
10 1 2 R_4 U_4 1
11 1 2 R_5 U_5 2
12 1 2 R_6 U_6 3
13 2 3 R_7 U_7 1
14 2 3 R_8 U_8 2
15 2 3 R_9 U_9 3
演示 test_tibble
-
test_tibble %>%
group_by(country) %>%
tidyr::complete(nesting(country, campaign), nesting(revenue, users)) %>%
group_by(campaign, add = TRUE) %>%
summarise(
ltv = sum(revenue)/sum(users)/2 +
sum(revenue[!is.na(activation_date)])/sum(users[!is.na(activation_date)])/2
) %>%
ungroup()
# A tibble: 3 x 3
country campaign ltv
<dbl> <dbl> <dbl>
1 1 1 0.213
2 1 2 0.296
3 2 3 0.444
假设我在 R 中有以下小标题:
activation_date | country | campaign | revenue | users
======================================================
1 | 1 | 1 | R_1 | U_1
2 | 1 | 1 | R_2 | U_2
3 | 1 | 1 | R_3 | U_3
1 | 1 | 2 | R_4 | U_4
2 | 1 | 2 | R_5 | U_5
3 | 1 | 2 | R_6 | U_6
1 | 2 | 3 | R_7 | U_7
2 | 2 | 3 | R_8 | U_8
3 | 2 | 3 | R_9 | U_9
我想按国家/地区对这个小标题进行分组并汇总其数据以将这个小标题作为其输出:
country | campaign | ltv
==========================
1 | 1 | ltv_1
1 | 2 | ltv_2
2 | 3 | ltv_3
但是,我希望 ltv_1
和 ltv_2
都使用 R_1
到 R_6
和 [=19] =]到U_6
联合计算,ltv_3
用R_7
到R_9
和U_7
到U_9
计算。
我不能 group_by
"country" 和 summarise
,因为那去掉了我想保留的 "campaign" 列,但我不能 group_by
"country" 和 "campaign" 要么因为那样我就不能使用前三行来帮助计算 ltv_2
也不能使用后面的三行来帮助计算 ltv_1
.
一种可能的方法是按 "country" 分组并使用 group_modify
函数生成分组的输出小标题。但是,该功能处于 "experimental" 阶段,所以我不想过分依赖它。有没有不同的、既定的方法来做到这一点?
一个示例输入小标题是:
# A tibble: 9 x 5
activation_date country campaign revenue users
<dbl> <dbl> <dbl> <dbl> <dbl>
1 1 1 1 1 11
2 2 1 1 2 12
3 3 1 1 3 13
4 1 1 2 4 14
5 2 1 2 5 15
6 3 1 2 6 16
7 1 2 3 7 17
8 2 2 3 8 18
9 3 2 3 9 19
其输出为:
# A tibble: 3 x 3
country campaign ltv
<dbl> <dbl> <dbl>
1 1 1 0.213
2 1 2 0.296
3 2 3 0.444
使用 group_modify
函数生成它的代码是:
test_tibble = tribble (~ activation_date, ~ country, ~ campaign, ~ revenue, ~ users,
1, 1, 1, 1, 11,
2, 1, 1, 2, 12,
3, 1, 1, 3, 13,
1, 1, 2, 4, 14,
2, 1, 2, 5, 15,
3, 1, 2, 6, 16,
1, 2, 3, 7, 17,
2, 2, 3, 8, 18,
3, 2, 3, 9, 19)
test_function = function (activation_date, campaign, revenue, users) {
total_ltv = sum (revenue) / sum (users)
campaign_ltv = double (0)
campaign_names = unique (campaign)
for (c in campaign_names) {
campaign_ltv = c (campaign_ltv, sum (revenue [campaign == c]) / sum (users [campaign == c]))
}
return (tibble (campaign = campaign_names,
ltv = campaign_ltv / 2 + total_ltv / 2))
}
test_tibble %>%
group_by (country) %>%
group_modify (~ test_function (.x$activation_date, .x$campaign, .x$revenue, .x$users)) %>%
ungroup
选项 1 -
有点冗长但透明 这样做的方法是joins
。但是,考虑到 test_function
中的代码也不是那么冗长。 -
test_tibble %>%
group_by(country, campaign) %>%
summarize(campaign_ltv = sum(revenue)/sum(users)) %>%
inner_join(
test_tibble %>%
group_by(country) %>%
summarise(total_ltv = sum(revenue)/sum(users)),
by = "country"
) %>%
mutate(ltv = (total_ltv + campaign_ltv)/2) %>%
ungroup()
# A tibble: 3 x 5
country campaign campaign_ltv total_ltv ltv
<dbl> <dbl> <dbl> <dbl> <dbl>
1 1 1 0.167 0.259 0.213
2 1 2 0.333 0.259 0.296
3 2 3 0.444 0.444 0.444
选项 2) -
将 test_function
输出包装在 list
中以获得嵌套的小标题并使用 unnest
.
test_tibble %>%
group_by (country) %>%
mutate(
ltv = list(test_function(activation_date, campaign, revenue, users))
) %>%
select(country, ltv) %>%
filter(row_number() == 1) %>%
unnest() %>%
ungroup()
# A tibble: 3 x 3
country campaign ltv
<dbl> <dbl> <dbl>
1 1 1 0.213
2 1 2 0.296
3 2 3 0.444
选项 3) -
df %>%
group_by(country) %>%
tidyr::complete(nesting(country, campaign), nesting(revenue, users)) %>%
group_by(campaign, add = TRUE)
# now you have all revenue and users for each country-campaign
# for total_ltv: use revenue and users as is
# for campaign_ltv: use revenue and users where activation_date is not NA
# A tibble: 15 x 5
# Groups: country, campaign [3]
country campaign revenue users activation_date
<int> <int> <chr> <chr> <int>
1 1 1 R_1 U_1 1
2 1 1 R_2 U_2 2
3 1 1 R_3 U_3 3
4 1 1 R_4 U_4 NA
5 1 1 R_5 U_5 NA
6 1 1 R_6 U_6 NA
7 1 2 R_1 U_1 NA
8 1 2 R_2 U_2 NA
9 1 2 R_3 U_3 NA
10 1 2 R_4 U_4 1
11 1 2 R_5 U_5 2
12 1 2 R_6 U_6 3
13 2 3 R_7 U_7 1
14 2 3 R_8 U_8 2
15 2 3 R_9 U_9 3
演示 test_tibble
-
test_tibble %>%
group_by(country) %>%
tidyr::complete(nesting(country, campaign), nesting(revenue, users)) %>%
group_by(campaign, add = TRUE) %>%
summarise(
ltv = sum(revenue)/sum(users)/2 +
sum(revenue[!is.na(activation_date)])/sum(users[!is.na(activation_date)])/2
) %>%
ungroup()
# A tibble: 3 x 3
country campaign ltv
<dbl> <dbl> <dbl>
1 1 1 0.213
2 1 2 0.296
3 2 3 0.444