R中每列的百分比
percentage of each column in R
我有一个包含 10 列的数据框,如下所示:
$ group (chr) "a", "b", "c", "d", "e"...
$ 1 (int) 31, 3965, 3381, 24745, 267, 20120, 795, 263,
$ 2 (int) 30, 4165, 412, 5064, 168, 8221, 259, 159, 6,
$ 3 (int) 13, 2308, 82, 1298, 37, 4314, 40, 110, 3,
$ 4 (int) 10, 1673, 10, 369, 12, 3178, 28, 53, 1, 2844,
...
列号 10 是每一行的总和,我想制作一个新的 DF,根据其总数显示每一列的分布。
即
(来源:imgsafe.org)
我怎样才能做到这一点?
我的数据输入:
structure(list(`1` = c(31L, 3965L, 3381L, 24745L, 267L, 20120L,
795L, 263L, 17L, 11920L, 3115L, 1079L, 1273L, 217L, 4298L, 24410L,
1L, 1L, 11008L, 5849L), `2` = c(30L, 4165L, 412L, 5064L, 168L,
8221L, 259L, 159L, 6L, 6948L, 2112L, 617L, 489L, 149L, 2869L,
13270L, 1L, NA, 9204L, 4547L), `3` = c(13L, 2308L, 82L, 1298L,
37L, 4314L, 40L, 110L, 3L, 3514L, 1046L, 391L, 597L, 66L, 2295L,
7785L, NA, NA, 5555L, 2425L), `4` = c(10L, 1673L, 10L, 369L,
12L, 3178L, 28L, 53L, 1L, 2844L, 577L, 359L, 321L, 43L, 1444L,
5313L, NA, NA, 4574L, 1833L), `5` = c(3L, 1068L, 1L, 139L, 4L,
2081L, 5L, 32L, NA, 2033L, 360L, 307L, 150L, 40L, 784L, 3376L,
NA, NA, 2855L, 1359L), `6` = c(11L, 759L, NA, 66L, 3L, 1507L,
3L, 20L, NA, 1610L, 197L, 190L, 114L, 29L, 591L, 2472L, NA, NA,
2165L, 1048L), `7` = c(7L, 518L, NA, 17L, 2L, 1109L, NA, 12L,
NA, 1076L, 142L, 120L, 72L, 24L, 445L, 1697L, 1L, NA, 1580L,
921L), `8` = c(5L, 389L, 2L, 14L, NA, 833L, NA, 13L, NA, 831L,
87L, 65L, 46L, 21L, 373L, 1279L, NA, NA, 1205L, 789L), `9` = c(6L,
299L, NA, 5L, NA, 646L, NA, 8L, NA, 588L, 73L, 36L, 28L, 9L,
261L, 933L, NA, NA, 929L, 601L), group = c("ACU", "ALE", "ANA",
"ANE", "BAN", "CAR", "CAR", "CIR", "CIR", "CIR", "CIR", "CIR",
"CIR", "CIR", "CIR", "DER", "DIA", "DIE", "DIG", "END"), tot = c(116,
15144, NA, 31717, NA, 42009, NA, 670, NA, 31364, 7709, 3164,
3090, 598, 13360, 60535, NA, NA, 39075, 19372)), class = c("tbl_df",
"tbl", "data.frame"), row.names = c(NA, -20L), .Names = c("1",
"2", "3", "4", "5", "6", "7", "8", "9", "group", "tot"))
看起来你想要这样的东西(我将 data.frame 命名为 df
):
data.frame(group = df$group,
#using lapply here, will divide each one of columns 1 to 9
#by the tot column in order to get the distributions
#the tot will be 100% on each occasion.
lapply(df[-10], function(x) x/df$tot))
输出:
group X1 X2 X3 X4 X5 X6 X7 X8 X9 tot
1 ACU 0.2672414 0.2586207 0.11206897 0.08620690 0.025862069 0.094827586 0.0603448276 0.0431034483 0.0517241379 1
2 ALE 0.2618199 0.2750264 0.15240359 0.11047279 0.070522979 0.050118859 0.0342049657 0.0256867406 0.0197437929 1
3 ANA NA NA NA NA NA NA NA NA NA NA
4 ANE 0.7801810 0.1596620 0.04092443 0.01163414 0.004382508 0.002080903 0.0005359902 0.0004414037 0.0001576442 1
5 BAN NA NA NA NA NA NA NA NA NA NA
6 CAR 0.4789450 0.1956962 0.10269228 0.07565046 0.049537004 0.035873265 0.0263991050 0.0198290842 0.0153776572 1
7 CAR NA NA NA NA NA NA NA NA NA NA
8 CIR 0.3925373 0.2373134 0.16417910 0.07910448 0.047761194 0.029850746 0.0179104478 0.0194029851 0.0119402985 1
9 CIR NA NA NA NA NA NA NA NA NA NA
10 CIR 0.3800536 0.2215279 0.11203928 0.09067721 0.064819538 0.051332738 0.0343068486 0.0264953450 0.0187476087 1
and so on....
注意:当 tot 列为 NA 时,您会得到 NA 行,因为除以 NA 将始终产生 NA。
执行此操作的简单方法是指定要占百分比的行,然后除以总数。
# Creating an example
x <- rnorm(10,5,2)
y <- rnorm(10,6,2)
df <- data.frame(group=letters[1:10],x,y,tot=x+y)
# Calculate the percentage and then copy groupnames to the new dataframe
newdf <- df[,2:4]/df$tot
newdf$group <- df$group
一个更优雅的方法是使用dplyr解决这个问题。
library(dplyr)
newdf <- df %>% group_by(group) %>% mutate_each(funs(./tot))
我有一个包含 10 列的数据框,如下所示:
$ group (chr) "a", "b", "c", "d", "e"...
$ 1 (int) 31, 3965, 3381, 24745, 267, 20120, 795, 263,
$ 2 (int) 30, 4165, 412, 5064, 168, 8221, 259, 159, 6,
$ 3 (int) 13, 2308, 82, 1298, 37, 4314, 40, 110, 3,
$ 4 (int) 10, 1673, 10, 369, 12, 3178, 28, 53, 1, 2844,
...
列号 10 是每一行的总和,我想制作一个新的 DF,根据其总数显示每一列的分布。
即
(来源:imgsafe.org)
我怎样才能做到这一点?
我的数据输入:
structure(list(`1` = c(31L, 3965L, 3381L, 24745L, 267L, 20120L,
795L, 263L, 17L, 11920L, 3115L, 1079L, 1273L, 217L, 4298L, 24410L,
1L, 1L, 11008L, 5849L), `2` = c(30L, 4165L, 412L, 5064L, 168L,
8221L, 259L, 159L, 6L, 6948L, 2112L, 617L, 489L, 149L, 2869L,
13270L, 1L, NA, 9204L, 4547L), `3` = c(13L, 2308L, 82L, 1298L,
37L, 4314L, 40L, 110L, 3L, 3514L, 1046L, 391L, 597L, 66L, 2295L,
7785L, NA, NA, 5555L, 2425L), `4` = c(10L, 1673L, 10L, 369L,
12L, 3178L, 28L, 53L, 1L, 2844L, 577L, 359L, 321L, 43L, 1444L,
5313L, NA, NA, 4574L, 1833L), `5` = c(3L, 1068L, 1L, 139L, 4L,
2081L, 5L, 32L, NA, 2033L, 360L, 307L, 150L, 40L, 784L, 3376L,
NA, NA, 2855L, 1359L), `6` = c(11L, 759L, NA, 66L, 3L, 1507L,
3L, 20L, NA, 1610L, 197L, 190L, 114L, 29L, 591L, 2472L, NA, NA,
2165L, 1048L), `7` = c(7L, 518L, NA, 17L, 2L, 1109L, NA, 12L,
NA, 1076L, 142L, 120L, 72L, 24L, 445L, 1697L, 1L, NA, 1580L,
921L), `8` = c(5L, 389L, 2L, 14L, NA, 833L, NA, 13L, NA, 831L,
87L, 65L, 46L, 21L, 373L, 1279L, NA, NA, 1205L, 789L), `9` = c(6L,
299L, NA, 5L, NA, 646L, NA, 8L, NA, 588L, 73L, 36L, 28L, 9L,
261L, 933L, NA, NA, 929L, 601L), group = c("ACU", "ALE", "ANA",
"ANE", "BAN", "CAR", "CAR", "CIR", "CIR", "CIR", "CIR", "CIR",
"CIR", "CIR", "CIR", "DER", "DIA", "DIE", "DIG", "END"), tot = c(116,
15144, NA, 31717, NA, 42009, NA, 670, NA, 31364, 7709, 3164,
3090, 598, 13360, 60535, NA, NA, 39075, 19372)), class = c("tbl_df",
"tbl", "data.frame"), row.names = c(NA, -20L), .Names = c("1",
"2", "3", "4", "5", "6", "7", "8", "9", "group", "tot"))
看起来你想要这样的东西(我将 data.frame 命名为 df
):
data.frame(group = df$group,
#using lapply here, will divide each one of columns 1 to 9
#by the tot column in order to get the distributions
#the tot will be 100% on each occasion.
lapply(df[-10], function(x) x/df$tot))
输出:
group X1 X2 X3 X4 X5 X6 X7 X8 X9 tot
1 ACU 0.2672414 0.2586207 0.11206897 0.08620690 0.025862069 0.094827586 0.0603448276 0.0431034483 0.0517241379 1
2 ALE 0.2618199 0.2750264 0.15240359 0.11047279 0.070522979 0.050118859 0.0342049657 0.0256867406 0.0197437929 1
3 ANA NA NA NA NA NA NA NA NA NA NA
4 ANE 0.7801810 0.1596620 0.04092443 0.01163414 0.004382508 0.002080903 0.0005359902 0.0004414037 0.0001576442 1
5 BAN NA NA NA NA NA NA NA NA NA NA
6 CAR 0.4789450 0.1956962 0.10269228 0.07565046 0.049537004 0.035873265 0.0263991050 0.0198290842 0.0153776572 1
7 CAR NA NA NA NA NA NA NA NA NA NA
8 CIR 0.3925373 0.2373134 0.16417910 0.07910448 0.047761194 0.029850746 0.0179104478 0.0194029851 0.0119402985 1
9 CIR NA NA NA NA NA NA NA NA NA NA
10 CIR 0.3800536 0.2215279 0.11203928 0.09067721 0.064819538 0.051332738 0.0343068486 0.0264953450 0.0187476087 1
and so on....
注意:当 tot 列为 NA 时,您会得到 NA 行,因为除以 NA 将始终产生 NA。
执行此操作的简单方法是指定要占百分比的行,然后除以总数。
# Creating an example
x <- rnorm(10,5,2)
y <- rnorm(10,6,2)
df <- data.frame(group=letters[1:10],x,y,tot=x+y)
# Calculate the percentage and then copy groupnames to the new dataframe
newdf <- df[,2:4]/df$tot
newdf$group <- df$group
一个更优雅的方法是使用dplyr解决这个问题。
library(dplyr)
newdf <- df %>% group_by(group) %>% mutate_each(funs(./tot))