面板数据 - 按组求和并创建新变量
Panel Data - sum by group and create new variable
我知道已经有很多关于 "sum by group" 的问题,但是,我的问题没有得到解决。就是这样:
df1 是我的简化数据集
> df1 = data.table( Year = c(2009,2009,2009,2009,2009,2009,2009,2009,2010,2010,2010,2010),
ID = c(1621, 1621, 1628,1628,3101, 3101,3105,3105,1621, 1621, 1628,1628 ),
category= c("0910","0910","0911","0913", "0914", "0910","0910","0911","1014","1012","1011","1013"),
var1 = c(60,70, 400,300,15,20, 200,150,61,71,401,301) )
df2 是期望的结果(参见 var2):
> df2 = data.table( Year = c(2009,2009,2009,2009,2009,2009,2009,2009,2010,2010,2010,2010),
ID = c(1621, 1621, 1628,1628,3101, 3101,3105,3105,1621, 1621, 1628,1628 ),
category= c("0910","0910","0911","0913", "0914", "0910","0910","0911","1014","1012","1011","1013"),
var1 = c(60,70, 400,300,15,20, 200,150,61,71,401,301),
var2= c(130,130,700,700,35,35,350,350,132,132,702,702) )
所以我想计算 var1
按 ID
分组的总和和 category
的前两个整数
所以如果变量类别的前两个整数是09(或10等),则将ID
和[=的前两个整数赋值给var2
15=]。然后,同一类别中的相同 ID 应分配相同的总和。
我试图通过
实现
> df1$var2 = rep(NA, rep(length(df1$ID)))
df1$var2 = ifelse(substr(df1$category,1,2)=="09", by(df1[Year==2009,]$var1, df1[Year==2009,]$ID,sum), df1$var2)
df1$Var2 = ifelse(substr(df1$category,1,2)=="10", by(df1[Year==2010,]$var1, df1[Year==2010,]$ID,sum), df1$var1)
但是这里的总和没有分配给正确的项目。
有人能帮帮我吗?
df1 = data.frame( Year = c(2009,2009,2009,2009,2009,2009,2009,2009,2010,2010,2010,2010),
ID = c(1621, 1621, 1628,1628,3101, 3101,3105,3105,1621, 1621, 1628,1628 ),
category= c("0910",NA,"0911","0913", "0914", "0910","0910",NA,"1014","1012",NA,"1013"),
var1 = c(60,70, 400,300,15,20, 200,150,61,71,401,301) )
我在 OP 的原始数据框中添加了 NA 值以反映他想要的完整规范。
df1$category_sub = substr(df1$category, 1, 2)
df1_aggre = aggregate(var1 ~ ID + category_sub, data = df1, sum)
names(df1_aggre)[3] = "var2"
df2 = merge(df1, df1_aggre, all=TRUE)
df2[order(df2$Year),]
结果:
> df2[order(df2$Year),]
ID category_sub Year category var1 var2
1 1621 09 2009 0910 60 60
4 1621 <NA> 2009 <NA> 70 NA
5 1628 09 2009 0911 400 700
6 1628 09 2009 0913 300 700
9 3101 09 2009 0914 15 35
10 3101 09 2009 0910 20 35
11 3105 09 2009 0910 200 200
12 3105 <NA> 2009 <NA> 150 NA
2 1621 10 2010 1014 61 132
3 1621 10 2010 1012 71 132
7 1628 10 2010 1013 301 301
8 1628 <NA> 2010 <NA> 401 NA
我首先从category
中提取了前两个整数,然后将var1
分组为ID
和category_sub
。然后我将 var1
重命名为 var2
,并通过 ID
和 category_sub
与 all=TRUE
选项合并 df1
和 df1_aggre
。这指定了一个完整的外部连接。生成的数据框未排序,因此我将 df2
按 Year
排序以获得所需的结果。
我知道已经有很多关于 "sum by group" 的问题,但是,我的问题没有得到解决。就是这样:
df1 是我的简化数据集
> df1 = data.table( Year = c(2009,2009,2009,2009,2009,2009,2009,2009,2010,2010,2010,2010),
ID = c(1621, 1621, 1628,1628,3101, 3101,3105,3105,1621, 1621, 1628,1628 ),
category= c("0910","0910","0911","0913", "0914", "0910","0910","0911","1014","1012","1011","1013"),
var1 = c(60,70, 400,300,15,20, 200,150,61,71,401,301) )
df2 是期望的结果(参见 var2):
> df2 = data.table( Year = c(2009,2009,2009,2009,2009,2009,2009,2009,2010,2010,2010,2010),
ID = c(1621, 1621, 1628,1628,3101, 3101,3105,3105,1621, 1621, 1628,1628 ),
category= c("0910","0910","0911","0913", "0914", "0910","0910","0911","1014","1012","1011","1013"),
var1 = c(60,70, 400,300,15,20, 200,150,61,71,401,301),
var2= c(130,130,700,700,35,35,350,350,132,132,702,702) )
所以我想计算 var1
按 ID
分组的总和和 category
所以如果变量类别的前两个整数是09(或10等),则将ID
和[=的前两个整数赋值给var2
15=]。然后,同一类别中的相同 ID 应分配相同的总和。
我试图通过
实现> df1$var2 = rep(NA, rep(length(df1$ID)))
df1$var2 = ifelse(substr(df1$category,1,2)=="09", by(df1[Year==2009,]$var1, df1[Year==2009,]$ID,sum), df1$var2)
df1$Var2 = ifelse(substr(df1$category,1,2)=="10", by(df1[Year==2010,]$var1, df1[Year==2010,]$ID,sum), df1$var1)
但是这里的总和没有分配给正确的项目。
有人能帮帮我吗?
df1 = data.frame( Year = c(2009,2009,2009,2009,2009,2009,2009,2009,2010,2010,2010,2010),
ID = c(1621, 1621, 1628,1628,3101, 3101,3105,3105,1621, 1621, 1628,1628 ),
category= c("0910",NA,"0911","0913", "0914", "0910","0910",NA,"1014","1012",NA,"1013"),
var1 = c(60,70, 400,300,15,20, 200,150,61,71,401,301) )
我在 OP 的原始数据框中添加了 NA 值以反映他想要的完整规范。
df1$category_sub = substr(df1$category, 1, 2)
df1_aggre = aggregate(var1 ~ ID + category_sub, data = df1, sum)
names(df1_aggre)[3] = "var2"
df2 = merge(df1, df1_aggre, all=TRUE)
df2[order(df2$Year),]
结果:
> df2[order(df2$Year),]
ID category_sub Year category var1 var2
1 1621 09 2009 0910 60 60
4 1621 <NA> 2009 <NA> 70 NA
5 1628 09 2009 0911 400 700
6 1628 09 2009 0913 300 700
9 3101 09 2009 0914 15 35
10 3101 09 2009 0910 20 35
11 3105 09 2009 0910 200 200
12 3105 <NA> 2009 <NA> 150 NA
2 1621 10 2010 1014 61 132
3 1621 10 2010 1012 71 132
7 1628 10 2010 1013 301 301
8 1628 <NA> 2010 <NA> 401 NA
我首先从category
中提取了前两个整数,然后将var1
分组为ID
和category_sub
。然后我将 var1
重命名为 var2
,并通过 ID
和 category_sub
与 all=TRUE
选项合并 df1
和 df1_aggre
。这指定了一个完整的外部连接。生成的数据框未排序,因此我将 df2
按 Year
排序以获得所需的结果。