R中同一组的一列中有多少个类别?
How many categories there are in a column for the same group in R?
我有一个数据框 (df),其中我有 2 家不同公司 (CompanyID) 两年(2006 年和 2007 年)的董事 (DirectorID) 以及他们各自的性别(男或女)。
df <-
CompanyID Name Country ISIN Director_2006 Gender_2006 Director_2007 Gender_2007
25830 BANKxxx Austria AT000504 11734844255 M 11734844255 M
25830 BANKxxx Austria AT000504 187836811559 F 5524344997 F
25830 BANKxxx Austria AT000504 5524344997 F 5524354997 M
25830 BANKxxx Austria AT000504 5524354997 M 5742347684 M
25830 BANKxxx Austria AT000504 6613115791 M 40160443378 M
12339 BANKyyy Belgium AT034003 5524344997 M 5524344997 M
12339 BANKyyy Belgium AT034003 5524354997 M 5524354997 M
我想在每个性别列之后添加更多 5 列,即在 "Gender_2006" 和 "Gender_2007" 之后,并提供以下信息:
- 第 1 列:当年该公司的女性人数
- 第 2 列:当年该公司的男性人数
- 第3列:如果那年那家公司至少有一名女性我加数字1,如果没有我加数字0
- 第 4 列:当年该公司的女性 (F) 百分比
- 第 5 列:Blau 指数计算
df_final 是我预期的最终输出。
df_final <-
CompanyID Name Country ISIN Director_2006 Gender_2006 F2006 M2006 Findex2006 Fperce2006 Blauindex2006 Director_2007 Gender_2007 F2007 M2007 Findex2007 Fperce2007 Blauindex2007
25830 BANKxxx Austria AT000504 11734844255 M 2 3 1 0.4 0.25 11734844255 M 1 4 1 0.25 0.07
25830 BANKxxx Austria AT000504 187836811559 F NA NA NA NA NA 5524344997 F NA NA NA NA NA
25830 BANKxxx Austria AT000504 5524344997 F NA NA NA NA NA 5524354997 M NA NA NA NA NA
25830 BANKxxx Austria AT000504 5524354997 M NA NA NA NA NA 5742347684 M NA NA NA NA NA
25830 BANKxxx Austria AT000504 6613115791 M NA NA NA NA NA 40160443378 M NA NA NA NA NA
12339 BANKyyy Belgium AT034003 5524344997 M 0 2 0 0 0 5524344997 M 0 2 0 0 0
12339 BANKyyy Belgium AT034003 5524354997 M NA NA NA NA NA 5524354997 M NA NA NA NA NA
拜托,有人可以告诉我吗?谢谢
我的数据
df <- read.table(text =
"CompanyID Name Country ISIN Director_2006 Gender_2006 Director_2007 Gender_2007
25830 BANKxxx Austria AT000504 11734844255 M 11734844255 M
25830 BANKxxx Austria AT000504 187836811559 F 5524344997 F
25830 BANKxxx Austria AT000504 5524344997 F 5524354997 M
25830 BANKxxx Austria AT000504 5524354997 M 5742347684 M
25830 BANKxxx Austria AT000504 6613115791 M 40160443378 M
12339 BANKyyy Belgium AT034003 5524344997 M 5524344997 M
12339 BANKyyy Belgium AT034003 5524354997 M 5524354997 M",
header = T, stringsAsFactors = F)
类似于 dplyr
中的以下内容,group_by
子句指示分组依据,在本例中为 companyID。 mutate
将根据您指定的条件创建新行。 select
只是更改了顺序。
library(dplyr)
df %>% group_by(CompanyID) %>%
mutate(F2006 = sum(Gender_2006 == "F", na.rm = T),
M2006 = sum(Gender_2006 == "M", na.rm = T),
Findex2006 = as.integer(sum(Gender_2006 == "F", na.rm = T)>0),
Fperce2006 = F2006/(F2006+M2006),
F2007 = sum(Gender_2007 == "F", na.rm = T),
M2007 = sum(Gender_2007 == "M", na.rm = T),
Findex2007 = as.integer(sum(Gender_2007 == "F", na.rm = T)>0),
Fperce2007 = F2007/(F2007+M2007)) %>%
select(-matches("2006|2007"),matches("2006"), matches("2007"))
# A tibble: 8 x 16
# Groups: CompanyID [2]
# CompanyID Name Country ISIN Director_2006 Gender_2006 F2006 M2006 Findex2006 Fperce2006 Director_2007 Gender_2007
# <int> <fct> <fct> <fct> <dbl> <fct> <int> <int> <int> <dbl> <dbl> <fct>
# 1 25830 BANKxxx Austria AT000504 11734844255 M 2 3 1 0.400 11734844255 M
# 2 25830 BANKxxx Austria AT000504 187836811559 F 2 3 1 0.400 5524344997 F
# 3 25830 BANKxxx Austria AT000504 5524344997 F 2 3 1 0.400 5524354997 M
# 4 25830 BANKxxx Austria AT000504 5524354997 M 2 3 1 0.400 5742347684 M
# 5 25830 BANKxxx Austria AT000504 6613115791 M 2 3 1 0.400 40160443378 M
# 6 12339 BANKyyy Belgium AT034003 5524344997 M 0 2 0 0 5524344997 M
# 7 12339 BANKyyy Belgium AT034003 5524354997 M 0 2 0 0 5524354997 M
# 8 12339 BANKyyy Belgium AT034003 NA <NA> 0 2 0 0 NA <NA>
如果您需要除第一行以外的所有行中的 NA,您可以将 mutate 更改为类似以下内容:
F2006 = ifelse(row_number()==1,sum(Gender_2006 == "F", na.rm = T),NA),
我有一个数据框 (df),其中我有 2 家不同公司 (CompanyID) 两年(2006 年和 2007 年)的董事 (DirectorID) 以及他们各自的性别(男或女)。
df <-
CompanyID Name Country ISIN Director_2006 Gender_2006 Director_2007 Gender_2007
25830 BANKxxx Austria AT000504 11734844255 M 11734844255 M
25830 BANKxxx Austria AT000504 187836811559 F 5524344997 F
25830 BANKxxx Austria AT000504 5524344997 F 5524354997 M
25830 BANKxxx Austria AT000504 5524354997 M 5742347684 M
25830 BANKxxx Austria AT000504 6613115791 M 40160443378 M
12339 BANKyyy Belgium AT034003 5524344997 M 5524344997 M
12339 BANKyyy Belgium AT034003 5524354997 M 5524354997 M
我想在每个性别列之后添加更多 5 列,即在 "Gender_2006" 和 "Gender_2007" 之后,并提供以下信息:
- 第 1 列:当年该公司的女性人数
- 第 2 列:当年该公司的男性人数
- 第3列:如果那年那家公司至少有一名女性我加数字1,如果没有我加数字0
- 第 4 列:当年该公司的女性 (F) 百分比
- 第 5 列:Blau 指数计算
df_final 是我预期的最终输出。
df_final <-
CompanyID Name Country ISIN Director_2006 Gender_2006 F2006 M2006 Findex2006 Fperce2006 Blauindex2006 Director_2007 Gender_2007 F2007 M2007 Findex2007 Fperce2007 Blauindex2007
25830 BANKxxx Austria AT000504 11734844255 M 2 3 1 0.4 0.25 11734844255 M 1 4 1 0.25 0.07
25830 BANKxxx Austria AT000504 187836811559 F NA NA NA NA NA 5524344997 F NA NA NA NA NA
25830 BANKxxx Austria AT000504 5524344997 F NA NA NA NA NA 5524354997 M NA NA NA NA NA
25830 BANKxxx Austria AT000504 5524354997 M NA NA NA NA NA 5742347684 M NA NA NA NA NA
25830 BANKxxx Austria AT000504 6613115791 M NA NA NA NA NA 40160443378 M NA NA NA NA NA
12339 BANKyyy Belgium AT034003 5524344997 M 0 2 0 0 0 5524344997 M 0 2 0 0 0
12339 BANKyyy Belgium AT034003 5524354997 M NA NA NA NA NA 5524354997 M NA NA NA NA NA
拜托,有人可以告诉我吗?谢谢
我的数据
df <- read.table(text =
"CompanyID Name Country ISIN Director_2006 Gender_2006 Director_2007 Gender_2007
25830 BANKxxx Austria AT000504 11734844255 M 11734844255 M
25830 BANKxxx Austria AT000504 187836811559 F 5524344997 F
25830 BANKxxx Austria AT000504 5524344997 F 5524354997 M
25830 BANKxxx Austria AT000504 5524354997 M 5742347684 M
25830 BANKxxx Austria AT000504 6613115791 M 40160443378 M
12339 BANKyyy Belgium AT034003 5524344997 M 5524344997 M
12339 BANKyyy Belgium AT034003 5524354997 M 5524354997 M",
header = T, stringsAsFactors = F)
类似于 dplyr
中的以下内容,group_by
子句指示分组依据,在本例中为 companyID。 mutate
将根据您指定的条件创建新行。 select
只是更改了顺序。
library(dplyr)
df %>% group_by(CompanyID) %>%
mutate(F2006 = sum(Gender_2006 == "F", na.rm = T),
M2006 = sum(Gender_2006 == "M", na.rm = T),
Findex2006 = as.integer(sum(Gender_2006 == "F", na.rm = T)>0),
Fperce2006 = F2006/(F2006+M2006),
F2007 = sum(Gender_2007 == "F", na.rm = T),
M2007 = sum(Gender_2007 == "M", na.rm = T),
Findex2007 = as.integer(sum(Gender_2007 == "F", na.rm = T)>0),
Fperce2007 = F2007/(F2007+M2007)) %>%
select(-matches("2006|2007"),matches("2006"), matches("2007"))
# A tibble: 8 x 16
# Groups: CompanyID [2]
# CompanyID Name Country ISIN Director_2006 Gender_2006 F2006 M2006 Findex2006 Fperce2006 Director_2007 Gender_2007
# <int> <fct> <fct> <fct> <dbl> <fct> <int> <int> <int> <dbl> <dbl> <fct>
# 1 25830 BANKxxx Austria AT000504 11734844255 M 2 3 1 0.400 11734844255 M
# 2 25830 BANKxxx Austria AT000504 187836811559 F 2 3 1 0.400 5524344997 F
# 3 25830 BANKxxx Austria AT000504 5524344997 F 2 3 1 0.400 5524354997 M
# 4 25830 BANKxxx Austria AT000504 5524354997 M 2 3 1 0.400 5742347684 M
# 5 25830 BANKxxx Austria AT000504 6613115791 M 2 3 1 0.400 40160443378 M
# 6 12339 BANKyyy Belgium AT034003 5524344997 M 0 2 0 0 5524344997 M
# 7 12339 BANKyyy Belgium AT034003 5524354997 M 0 2 0 0 5524354997 M
# 8 12339 BANKyyy Belgium AT034003 NA <NA> 0 2 0 0 NA <NA>
如果您需要除第一行以外的所有行中的 NA,您可以将 mutate 更改为类似以下内容:
F2006 = ifelse(row_number()==1,sum(Gender_2006 == "F", na.rm = T),NA),