在数据框中创建一个新列:组中的索引(组之间不是唯一的)

Create an new column in data frame : index in group (not unique between groups)

我有一个包含两列的数据框:第一列包含每个人所属的组,第二列包含个人的 ID。见下文:

df <- data.frame( group=c('G1','G1','G1','G1','G2','G2','G2','G2'), 
      indiv=c('indiv1','indiv1','indiv2','indiv2','indiv3',
              'indiv3','indiv4','indiv4'))

   group   indiv
1     G1  indiv1
2     G1  indiv1
3     G1  indiv2
4     G1  indiv2
5     G2  indiv3
6     G2  indiv3
7     G2  indiv4
8     G2  indiv4

我想在我的数据框中创建一个新列(保留长格式),其中包含组中每个人的索引,即:

   group   indiv  Ineed
1     G1  indiv1      1
2     G1  indiv1      1
3     G1  indiv2      2
4     G1  indiv2      2
5     G2  indiv3      1
6     G2  indiv3      1
7     G2  indiv4      2
8     G2  indiv4      2

我已经尝试使用 data.table .N 或 .GRP 方法,但没有成功(顺便说一句,data.table 做得很好!)。

非常感谢任何帮助!

您可以在此处使用新的 rleid 函数(来自开发版本 v >= 1.9.5)

setDT(df)[, Ineed := rleid(indiv), group][]
#    group  indiv Ineed
# 1:    G1 indiv1     1
# 2:    G1 indiv1     1
# 3:    G1 indiv2     2
# 4:    G1 indiv2     2
# 5:    G2 indiv3     1
# 6:    G2 indiv3     1
# 7:    G2 indiv4     2
# 8:    G2 indiv4     2

或者您可以转换为因数(以便创建独特的组)然后将它们转换回数字(如果您使用 CRAN 稳定版本 v <= 1.9.4)

setDT(df)[, Ineed := as.numeric(factor(indiv)), group][]
#    group  indiv Ineed
# 1:    G1 indiv1     1
# 2:    G1 indiv1     1
# 3:    G1 indiv2     2
# 4:    G1 indiv2     2
# 5:    G2 indiv3     1
# 6:    G2 indiv3     1
# 7:    G2 indiv4     2
# 8:    G2 indiv4     2

1.9.5(当前开发版本)中,函数frank(和frankv)被导出。有了它,您可以:

require(data.table) ## 1.9.5+
setDT(df)[, col := frank(indiv, ties.method="dense"), by=group]
df
#    group  indiv col
# 1:    G1 indiv1   1
# 2:    G1 indiv1   1
# 3:    G1 indiv2   2
# 4:    G1 indiv2   2
# 5:    G2 indiv3   1
# 6:    G2 indiv3   1
# 7:    G2 indiv4   2
# 8:    G2 indiv4   2

您可以按照说明进行安装here

另一个选项使用 base R

df$Ineed <- with(df, ave(as.numeric(indiv), group, 
                  FUN=function(x) cumsum(!duplicated(x))))
df
#  group  indiv Ineed
#1    G1 indiv1     1
#2    G1 indiv1     1
#3    G1 indiv2     2
#4    G1 indiv2     2
#5    G2 indiv3     1
#6    G2 indiv3     1
#7    G2 indiv4     2
#8    G2 indiv4     2

data.table 版本为

setDT(df)[, Ineed := cumsum(!duplicated(indiv)), group][]