R计算百分比而不将数据框转换为宽数据框并返回长数据框
R calculate percentage without turning the data frame into a wide data frame and back to long one
我希望计算每个 tier_1 组的新客户数量占客户总数的百分比。我能够做到这一点,但只能通过变形为我不喜欢的宽数据框架。有没有办法在不变形为宽数据框的情况下做到这一点?
这是我的第一个数据框的复制和粘贴版本。它的格式很长。 tier_1 类别将每个渠道列出两次,一次是 new_to_file_string ==“新客户”,一次是 new_to_file_string ==“回头客”。
new_to_file_string tier_1 sum
New Customer Paid Display 1.053554
New Customer Paid Search 17429.703628
New Customer Paid Shopping 192.840719
New Customer Paid Social 5589.378029
New Customer Paid Video 301.723091
New Customer Podcasts 22.268319
New Customer Referring Domain 655.758022
New Customer Unmapped Events 105.928832
Returning Customer Affiliate 410.585386
Returning Customer Audio 32.556144
这里是输出版本:
structure(list(new_to_file_string = c("New Customer", "New Customer",
"New Customer", "New Customer", "New Customer", "New Customer",
"New Customer", "New Customer", "New Customer", "New Customer",
"New Customer", "New Customer", "New Customer", "New Customer",
"New Customer", "New Customer", "New Customer", "New Customer",
"Returning Customer", "Returning Customer", "Returning Customer",
"Returning Customer", "Returning Customer", "Returning Customer",
"Returning Customer", "Returning Customer", "Returning Customer",
"Returning Customer", "Returning Customer", "Returning Customer",
"Returning Customer", "Returning Customer", "Returning Customer",
"Returning Customer"), tier_1 = c("Affiliate", "Audio", "Customer Referral",
"Direct", "Display", "Email", "Organic Search", "Organic Social",
"Organic Video", "OTT", "Paid Display", "Paid Search", "Paid Shopping",
"Paid Social", "Paid Video", "Podcasts", "Referring Domain",
"Unmapped Events", "Affiliate", "Audio", "Direct", "Display",
"Email", "Organic Search", "Organic Social", "Organic Video",
"OTT", "Paid Search", "Paid Shopping", "Paid Social", "Paid Video",
"Podcasts", "Referring Domain", "Unmapped Events"), sum = c(971.6513387549,
20.9797788595, 4.0590886922, 3506, 80.2643802952, 1420.5576826329,
1556.3489737375, 349.5952195416, 367.403281163, 1364.4860623594,
1.0535537876, 17429.7036282718, 192.8407187215, 5589.378028519,
301.7230914497, 22.2683186546, 655.7580222743, 105.9288322873,
410.5853859286, 32.5561439337, 7327, 176.1993862616, 2222.2366388167,
899.3438590657, 58.5263954508, 47.8624061728, 463.8330675232,
4519.7009051073, 25.9581954589, 963.1761381512, 34.2195099128,
12.7666333106, 276.6026075478, 37.4327273523)), row.names = c(NA,
-34L), groups = structure(list(new_to_file_string = c("New Customer",
"Returning Customer"), .rows = structure(list(1:18, 19:34), ptype = integer(0), class = c("vctrs_list_of",
"vctrs_vctr", "list"))), row.names = c(NA, -2L), class = c("tbl_df",
"tbl", "data.frame"), .drop = TRUE), class = c("grouped_df",
"tbl_df", "tbl", "data.frame"))
这是我的尝试:我创建了一个 pivot_wider 版本,更改了列名并使用 mutate 函数创建了一个新变量:
new_vs_returning2 <- pivot_wider(new_vs_returning, names_from = new_to_file_string, values_from = sum)
colnames(new_vs_returning2) <- c("Channel", "Returning_Customers", "New_Customers")
new_vs_returning2 <- new_vs_returning2 %>%
mutate(Percent_New_Customers = New_Customers / (Returning_Customers + New_Customers)) %>%
mutate(Percent_Returning_Customers = (1 - Percent_New_Customers))
这里是新数据框的c+p版本:
Channel Returning_Customers New_Customers Percent_New_Customers
Affiliate 971.651339 410.58539 0.2970442
Audio 20.979779 32.55614 0.6081177
Customer Referral 4.059089 NA NA
Direct 3506.000000 7327.00000 0.6763593
Display 80.264380 176.19939 0.6870342
Email 1420.557683 2222.23664 0.6100363
Organic Search 1556.348974 899.34386 0.3662282
Organic Social 349.595220 58.52640 0.1434043
Organic Video 367.403281 47.86241 0.1152573
OTT 1364.486062 463.83307 0.2536937
这里是新数据框的输出版本:
structure(list(Channel = c("Affiliate", "Audio", "Customer Referral",
"Direct", "Display", "Email", "Organic Search", "Organic Social",
"Organic Video", "OTT", "Paid Display", "Paid Search", "Paid Shopping",
"Paid Social", "Paid Video", "Podcasts", "Referring Domain",
"Unmapped Events"), Returning_Customers = c(971.6513387549, 20.9797788595,
4.0590886922, 3506, 80.2643802952, 1420.5576826329, 1556.3489737375,
349.5952195416, 367.403281163, 1364.4860623594, 1.0535537876,
17429.7036282718, 192.8407187215, 5589.378028519, 301.7230914497,
22.2683186546, 655.7580222743, 105.9288322873), New_Customers = c(410.5853859286,
32.5561439337, NA, 7327, 176.1993862616, 2222.2366388167, 899.3438590657,
58.5263954508, 47.8624061728, 463.8330675232, NA, 4519.7009051073,
25.9581954589, 963.1761381512, 34.2195099128, 12.7666333106,
276.6026075478, 37.4327273523), Percent_New_Customers = c(0.297044188304731,
0.60811773170435, NA, 0.676359272593003, 0.687034229541257, 0.610036264120559,
0.366228156491009, 0.143404302298201, 0.115257310277352, 0.253693712406205,
NA, 0.205914511176559, 0.118639507678258, 0.146992472500331,
0.101861180374308, 0.364397054783492, 0.296669119974079, 0.261107143689027
)), row.names = c(NA, -18L), class = c("tbl_df", "tbl", "data.frame"
))
我觉得我能够正确计算百分比,但我现在有一个宽数据框而不是一个长数据框。我可以使用此方法返回到长数据框:
new_vs_returning2 <- new_vs_returning2 %>%
dplyr::select(Channel, Percent_New_Customers, Percent_Returning_Customers)
new_vs_returning2 <- pivot_longer(new_vs_returning2, names_to = "Customer Type", values_to = "Percentage", 2:3)
但是有没有更有效的方法/另一种方法来计算新客户占总客户的百分比(新+回头客)(以及回头客占总客户的百分比)而不将数据框变形为宽data frame又变长了?
在您的解决方案(宽格式)中,回头客和新客户的值互换,因此 % 计算不正确。
我建议这样做:
library(dplyr)
df %>% group_by(tier_1) %>%
summarize(perc_new = sum[which(new_to_file_string == 'New Customer')]/ sum(sum))
导致:
# A tibble: 18 x 2
tier_1 perc_new
<chr> <dbl>
1 Affiliate 0.703
2 Audio 0.392
3 Customer Referral 1
4 Direct 0.324
5 Display 0.313
6 Email 0.390
7 Organic Search 0.634
8 Organic Social 0.857
9 Organic Video 0.885
10 OTT 0.746
11 Paid Display 1
12 Paid Search 0.794
13 Paid Shopping 0.881
14 Paid Social 0.853
15 Paid Video 0.898
16 Podcasts 0.636
17 Referring Domain 0.703
18 Unmapped Events 0.739
我希望计算每个 tier_1 组的新客户数量占客户总数的百分比。我能够做到这一点,但只能通过变形为我不喜欢的宽数据框架。有没有办法在不变形为宽数据框的情况下做到这一点?
这是我的第一个数据框的复制和粘贴版本。它的格式很长。 tier_1 类别将每个渠道列出两次,一次是 new_to_file_string ==“新客户”,一次是 new_to_file_string ==“回头客”。
new_to_file_string tier_1 sum
New Customer Paid Display 1.053554
New Customer Paid Search 17429.703628
New Customer Paid Shopping 192.840719
New Customer Paid Social 5589.378029
New Customer Paid Video 301.723091
New Customer Podcasts 22.268319
New Customer Referring Domain 655.758022
New Customer Unmapped Events 105.928832
Returning Customer Affiliate 410.585386
Returning Customer Audio 32.556144
这里是输出版本:
structure(list(new_to_file_string = c("New Customer", "New Customer",
"New Customer", "New Customer", "New Customer", "New Customer",
"New Customer", "New Customer", "New Customer", "New Customer",
"New Customer", "New Customer", "New Customer", "New Customer",
"New Customer", "New Customer", "New Customer", "New Customer",
"Returning Customer", "Returning Customer", "Returning Customer",
"Returning Customer", "Returning Customer", "Returning Customer",
"Returning Customer", "Returning Customer", "Returning Customer",
"Returning Customer", "Returning Customer", "Returning Customer",
"Returning Customer", "Returning Customer", "Returning Customer",
"Returning Customer"), tier_1 = c("Affiliate", "Audio", "Customer Referral",
"Direct", "Display", "Email", "Organic Search", "Organic Social",
"Organic Video", "OTT", "Paid Display", "Paid Search", "Paid Shopping",
"Paid Social", "Paid Video", "Podcasts", "Referring Domain",
"Unmapped Events", "Affiliate", "Audio", "Direct", "Display",
"Email", "Organic Search", "Organic Social", "Organic Video",
"OTT", "Paid Search", "Paid Shopping", "Paid Social", "Paid Video",
"Podcasts", "Referring Domain", "Unmapped Events"), sum = c(971.6513387549,
20.9797788595, 4.0590886922, 3506, 80.2643802952, 1420.5576826329,
1556.3489737375, 349.5952195416, 367.403281163, 1364.4860623594,
1.0535537876, 17429.7036282718, 192.8407187215, 5589.378028519,
301.7230914497, 22.2683186546, 655.7580222743, 105.9288322873,
410.5853859286, 32.5561439337, 7327, 176.1993862616, 2222.2366388167,
899.3438590657, 58.5263954508, 47.8624061728, 463.8330675232,
4519.7009051073, 25.9581954589, 963.1761381512, 34.2195099128,
12.7666333106, 276.6026075478, 37.4327273523)), row.names = c(NA,
-34L), groups = structure(list(new_to_file_string = c("New Customer",
"Returning Customer"), .rows = structure(list(1:18, 19:34), ptype = integer(0), class = c("vctrs_list_of",
"vctrs_vctr", "list"))), row.names = c(NA, -2L), class = c("tbl_df",
"tbl", "data.frame"), .drop = TRUE), class = c("grouped_df",
"tbl_df", "tbl", "data.frame"))
这是我的尝试:我创建了一个 pivot_wider 版本,更改了列名并使用 mutate 函数创建了一个新变量:
new_vs_returning2 <- pivot_wider(new_vs_returning, names_from = new_to_file_string, values_from = sum)
colnames(new_vs_returning2) <- c("Channel", "Returning_Customers", "New_Customers")
new_vs_returning2 <- new_vs_returning2 %>%
mutate(Percent_New_Customers = New_Customers / (Returning_Customers + New_Customers)) %>%
mutate(Percent_Returning_Customers = (1 - Percent_New_Customers))
这里是新数据框的c+p版本:
Channel Returning_Customers New_Customers Percent_New_Customers
Affiliate 971.651339 410.58539 0.2970442
Audio 20.979779 32.55614 0.6081177
Customer Referral 4.059089 NA NA
Direct 3506.000000 7327.00000 0.6763593
Display 80.264380 176.19939 0.6870342
Email 1420.557683 2222.23664 0.6100363
Organic Search 1556.348974 899.34386 0.3662282
Organic Social 349.595220 58.52640 0.1434043
Organic Video 367.403281 47.86241 0.1152573
OTT 1364.486062 463.83307 0.2536937
这里是新数据框的输出版本:
structure(list(Channel = c("Affiliate", "Audio", "Customer Referral",
"Direct", "Display", "Email", "Organic Search", "Organic Social",
"Organic Video", "OTT", "Paid Display", "Paid Search", "Paid Shopping",
"Paid Social", "Paid Video", "Podcasts", "Referring Domain",
"Unmapped Events"), Returning_Customers = c(971.6513387549, 20.9797788595,
4.0590886922, 3506, 80.2643802952, 1420.5576826329, 1556.3489737375,
349.5952195416, 367.403281163, 1364.4860623594, 1.0535537876,
17429.7036282718, 192.8407187215, 5589.378028519, 301.7230914497,
22.2683186546, 655.7580222743, 105.9288322873), New_Customers = c(410.5853859286,
32.5561439337, NA, 7327, 176.1993862616, 2222.2366388167, 899.3438590657,
58.5263954508, 47.8624061728, 463.8330675232, NA, 4519.7009051073,
25.9581954589, 963.1761381512, 34.2195099128, 12.7666333106,
276.6026075478, 37.4327273523), Percent_New_Customers = c(0.297044188304731,
0.60811773170435, NA, 0.676359272593003, 0.687034229541257, 0.610036264120559,
0.366228156491009, 0.143404302298201, 0.115257310277352, 0.253693712406205,
NA, 0.205914511176559, 0.118639507678258, 0.146992472500331,
0.101861180374308, 0.364397054783492, 0.296669119974079, 0.261107143689027
)), row.names = c(NA, -18L), class = c("tbl_df", "tbl", "data.frame"
))
我觉得我能够正确计算百分比,但我现在有一个宽数据框而不是一个长数据框。我可以使用此方法返回到长数据框:
new_vs_returning2 <- new_vs_returning2 %>%
dplyr::select(Channel, Percent_New_Customers, Percent_Returning_Customers)
new_vs_returning2 <- pivot_longer(new_vs_returning2, names_to = "Customer Type", values_to = "Percentage", 2:3)
但是有没有更有效的方法/另一种方法来计算新客户占总客户的百分比(新+回头客)(以及回头客占总客户的百分比)而不将数据框变形为宽data frame又变长了?
在您的解决方案(宽格式)中,回头客和新客户的值互换,因此 % 计算不正确。
我建议这样做:
library(dplyr)
df %>% group_by(tier_1) %>%
summarize(perc_new = sum[which(new_to_file_string == 'New Customer')]/ sum(sum))
导致:
# A tibble: 18 x 2
tier_1 perc_new
<chr> <dbl>
1 Affiliate 0.703
2 Audio 0.392
3 Customer Referral 1
4 Direct 0.324
5 Display 0.313
6 Email 0.390
7 Organic Search 0.634
8 Organic Social 0.857
9 Organic Video 0.885
10 OTT 0.746
11 Paid Display 1
12 Paid Search 0.794
13 Paid Shopping 0.881
14 Paid Social 0.853
15 Paid Video 0.898
16 Podcasts 0.636
17 Referring Domain 0.703
18 Unmapped Events 0.739