运行 R 中的迭代以创建具有特定条件的新变量
Running iterations in R to create a new variable with specific conditions
所以我有这些数据,我想做的是创建一个变量来反映每个给定年份中地位最高的组。每个组可以具有以下状态:
* 1=垄断,
* 2= 主导,
* 3= 高级,
* 4=初级或
* 5= 歧视。
为 1 或 2 的组将自动获得最高地位,因为每个国家在任何给定年份都只有一个组拥有该地位。然而,有些国家/地区有多个属于 3 的组(有时 3 也是该国家当年可以达到的最高组状态),在这种情况下,我希望规模最大的组是被编码为具有最高地位的人。我该怎么做呢?
数据
D1 <- data.frame(row = c(1, 2, 3, 4, 5, 6, 7 , 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20),
country = c("US", "US", "US", "US", "US", "US", "US", "US","US", "US", "Canada", "Canada", "Canada", "Canada", "Canada", "Canada", "Canada", "Canada", "Canada", "Canada"),
year = c(1991, 1992, 1993, 1994, 1995, 1991, 1992, 1993, 1994, 1995, 1991, 1992, 1993, 1994, 1995, 1991, 1992, 1993, 1994, 1995),
group = c("White", "White", "White", "White", "White", "Latino", "Latino", "Latino", "Latino", "Latino","English", "English", "English", "English", "English", "French", "French", "French", "French", "French"),
groupstatus = c("1", "1", "1", "3", "3", "5", "5","5", "3", "3", "2", "2", "2", "3", "3", "3", "3", "3", "3", "4"),
groupsize= c(0.7, 0.7, 0.7, 0.7, 0.7, 0.15, 0.15, 0.15, 0.15, 0.15, 0.1, 0.1, 0.1, 0.1, 0.1, 0.2, 0.2, 0.2, 0.2, 0.2))
期望输出
D1 <- data.frame(row = c(1, 2, 3, 4, 5, 6, 7 , 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20),
country = c("US", "US", "US", "US", "US", "US", "US", "US","US", "US", "Canada", "Canada", "Canada", "Canada", "Canada", "Canada", "Canada", "Canada", "Canada", "Canada"),
year = c(1991, 1992, 1993, 1994, 1995, 1991, 1992, 1993, 1994, 1995, 1991, 1992, 1993, 1994, 1995, 1991, 1992, 1993, 1994, 1995),
group = c("White", "White", "White", "White", "White", "Latino", "Latino", "Latino", "Latino", "Latino","English", "English", "English", "English", "English", "French", "French", "French", "French", "French"),
groupstatus = c("1", "1", "1", "3", "3", "5", "5","5", "3", "3", "2", "2", "2", "3", "3", "3", "3", "3", "3", "4"),
groupsize= c(0.7, 0.7, 0.7, 0.7, 0.7, 0.15, 0.15, 0.15, 0.15, 0.15, 0.1, 0.1, 0.1, 0.1, 0.1, 0.2, 0.2, 0.2, 0.2, 0.2),
highest= c(1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 1, 1, 1, 0, 1, 0, 0, 0, 1, 0))
这是 data.table
的一种方式。
我们将 'data.frame' 转换为 'data.table' (setDT(D1)
)。按 'country' 和 'year' 分组,我们根据 'groupstatus' 中值 1 和 2 的存在创建一个二进制列 'highest'(也可以一步完成,但为了更容易理解,我将其拆分)。
下一步,按相同的列分组,我们检查'groupstatus'中的所有元素是否都是3(all(groupstatus==3)
)。如果是这样,我们得到最大值的逻辑索引 'groupsize' (groupsize==max(groupsize)
) 或 else
(即如果 'groupstatus' 中的某些值不是 3),我们看对于具有 'highest' 的组,值全部为“0”或 'FALSE' (!any(highest)
) 并且 'groupstatus' 为 3 (groupstatus==3
)。生成的逻辑向量可以通过 .I
更改为 'numeric' 行索引。我们提取行索引列 ($V1
) 并使用它来将 'highest' 中的值更改为 1。
setDT(D1)[, highest := +(groupstatus %in% 1:2) , .(country, year)]
indx <- D1[, .I[if(all(groupstatus==3)) groupsize==max(groupsize)
else !any(highest)& groupstatus==3], .(country, year)]$V1
D1[indx, highest := 1L]
所以我有这些数据,我想做的是创建一个变量来反映每个给定年份中地位最高的组。每个组可以具有以下状态: * 1=垄断, * 2= 主导, * 3= 高级, * 4=初级或 * 5= 歧视。 为 1 或 2 的组将自动获得最高地位,因为每个国家在任何给定年份都只有一个组拥有该地位。然而,有些国家/地区有多个属于 3 的组(有时 3 也是该国家当年可以达到的最高组状态),在这种情况下,我希望规模最大的组是被编码为具有最高地位的人。我该怎么做呢?
数据
D1 <- data.frame(row = c(1, 2, 3, 4, 5, 6, 7 , 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20),
country = c("US", "US", "US", "US", "US", "US", "US", "US","US", "US", "Canada", "Canada", "Canada", "Canada", "Canada", "Canada", "Canada", "Canada", "Canada", "Canada"),
year = c(1991, 1992, 1993, 1994, 1995, 1991, 1992, 1993, 1994, 1995, 1991, 1992, 1993, 1994, 1995, 1991, 1992, 1993, 1994, 1995),
group = c("White", "White", "White", "White", "White", "Latino", "Latino", "Latino", "Latino", "Latino","English", "English", "English", "English", "English", "French", "French", "French", "French", "French"),
groupstatus = c("1", "1", "1", "3", "3", "5", "5","5", "3", "3", "2", "2", "2", "3", "3", "3", "3", "3", "3", "4"),
groupsize= c(0.7, 0.7, 0.7, 0.7, 0.7, 0.15, 0.15, 0.15, 0.15, 0.15, 0.1, 0.1, 0.1, 0.1, 0.1, 0.2, 0.2, 0.2, 0.2, 0.2))
期望输出
D1 <- data.frame(row = c(1, 2, 3, 4, 5, 6, 7 , 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20),
country = c("US", "US", "US", "US", "US", "US", "US", "US","US", "US", "Canada", "Canada", "Canada", "Canada", "Canada", "Canada", "Canada", "Canada", "Canada", "Canada"),
year = c(1991, 1992, 1993, 1994, 1995, 1991, 1992, 1993, 1994, 1995, 1991, 1992, 1993, 1994, 1995, 1991, 1992, 1993, 1994, 1995),
group = c("White", "White", "White", "White", "White", "Latino", "Latino", "Latino", "Latino", "Latino","English", "English", "English", "English", "English", "French", "French", "French", "French", "French"),
groupstatus = c("1", "1", "1", "3", "3", "5", "5","5", "3", "3", "2", "2", "2", "3", "3", "3", "3", "3", "3", "4"),
groupsize= c(0.7, 0.7, 0.7, 0.7, 0.7, 0.15, 0.15, 0.15, 0.15, 0.15, 0.1, 0.1, 0.1, 0.1, 0.1, 0.2, 0.2, 0.2, 0.2, 0.2),
highest= c(1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 1, 1, 1, 0, 1, 0, 0, 0, 1, 0))
这是 data.table
的一种方式。
我们将 'data.frame' 转换为 'data.table' (setDT(D1)
)。按 'country' 和 'year' 分组,我们根据 'groupstatus' 中值 1 和 2 的存在创建一个二进制列 'highest'(也可以一步完成,但为了更容易理解,我将其拆分)。
下一步,按相同的列分组,我们检查'groupstatus'中的所有元素是否都是3(all(groupstatus==3)
)。如果是这样,我们得到最大值的逻辑索引 'groupsize' (groupsize==max(groupsize)
) 或 else
(即如果 'groupstatus' 中的某些值不是 3),我们看对于具有 'highest' 的组,值全部为“0”或 'FALSE' (!any(highest)
) 并且 'groupstatus' 为 3 (groupstatus==3
)。生成的逻辑向量可以通过 .I
更改为 'numeric' 行索引。我们提取行索引列 ($V1
) 并使用它来将 'highest' 中的值更改为 1。
setDT(D1)[, highest := +(groupstatus %in% 1:2) , .(country, year)]
indx <- D1[, .I[if(all(groupstatus==3)) groupsize==max(groupsize)
else !any(highest)& groupstatus==3], .(country, year)]$V1
D1[indx, highest := 1L]