根据 R 中特定年份的变量值对面板中的组进行分类
Categorize groups in panel according to value of a variable in a specific year in R
我在 R 中有一个不同国家的小组,我想根据特定年份(此处为 3)的特定变量(在本例中为 'var3')的值创建类别。
我目前拥有的一个例子:
# create data
test.data = as.data.frame(matrix(rexp(200, rate=.1), ncol=5))
colnames(test.data) = c("year", "country", "var1", "var2", "var3")
test.data$year = rep.int(1:5, 8)
test.data$country = rep(1:8, each=5)
# calculate median, minimum and maximum of 'var3'
median = quantile(x = test.data[test.data$year == 3, 5], probs = c(0.5))
min = min(test.data[test.data$year == 3, 5])
max = max(test.data[test.data$year == 3, 5])
# create category variable based on values of 'var3'
test.data$cat.1 = cut(test.data$var3, c(min, median, max))
在这种情况下,'cat.1' 的值取决于相应观测的 'var3' 的值,但我希望它取决于特定国家特定年份的值(即我想要一个特定国家/地区所有年份的相同值)。有没有一种直接的方法可以做到这一点,还是我必须手动完成(select 每个组的国家并为它们分配值)。如果组的数量是恒定的,手动操作是可以的,但是如果你想尝试不同的组大小,这有点麻烦。
目前结果如下:
year country var1 var2 var3 cat.1
1 1 1 4.4206363 9.32628504 4.0988089 (1.2,6.71]
2 2 1 7.6072491 6.30949828 39.5694414 <NA>
3 3 1 3.3774183 7.94397550 8.8419793 (6.71,22.2]
4 4 1 1.0300372 9.93858310 0.4908481 <NA>
5 5 1 6.4514008 2.10367840 29.6052797 <NA>
6 1 2 8.7609877 5.76332181 17.4117561 (6.71,22.2]
7 2 2 6.1253021 0.17258071 23.9096280 <NA>
8 3 2 48.3335241 1.19255084 3.3644827 (1.2,6.71]
9 4 2 34.1683821 10.98216846 29.0255100 <NA>
10 5 2 15.5824154 2.53484781 16.3466249 (6.71,22.2]
但我想要这个:
year country var1 var2 var3 cat.1
1 1 1 4.4206363 9.32628504 4.0988089 (6.71,22.2]
2 2 1 7.6072491 6.30949828 39.5694414 (6.71,22.2]
3 3 1 3.3774183 7.94397550 8.8419793 (6.71,22.2]
4 4 1 1.0300372 9.93858310 0.4908481 (6.71,22.2]
5 5 1 6.4514008 2.10367840 29.6052797 (6.71,22.2]
6 1 2 8.7609877 5.76332181 17.4117561 (1.2,6.71]
7 2 2 6.1253021 0.17258071 23.9096280 (1.2,6.71]
8 3 2 48.3335241 1.19255084 3.3644827 (1.2,6.71]
9 4 2 34.1683821 10.98216846 29.0255100 (1.2,6.71]
10 5 2 15.5824154 2.53484781 16.3466249 (1.2,6.71]
也许是以下几行?这首先为每个国家/地区创建一个变量,对应于第 3 年的 var3
,然后削减该变量。这应该适用于许多组,如果按组你指的是国家。
library(dplyr)
out <- test.data %>% group_by(country) %>% mutate(to.cut = var3[year==3] )
out$cat.1 = cut(out$to.cut, c(min, median, max), include.lowest=T)
out
Source: local data frame [40 x 7]
Groups: country [8]
year country var1 var2 var3 cat.1 to.cut
(int) (int) (dbl) (dbl) (dbl) (fctr) (dbl)
1 1 1 2.945957 8.785060 21.820063 (10.3,35.5] 12.06913
2 2 1 1.473719 29.944750 6.915839 (10.3,35.5] 12.06913
3 3 1 8.880734 3.624519 12.069131 (10.3,35.5] 12.06913
4 4 1 31.746000 9.698126 5.929075 (10.3,35.5] 12.06913
5 5 1 34.639945 2.983025 15.438284 (10.3,35.5] 12.06913
6 1 2 16.757240 8.719741 27.412963 (10.3,35.5] 14.74931
7 2 2 1.155467 3.146425 1.730943 (10.3,35.5] 14.74931
8 3 2 1.738710 2.292280 14.749311 (10.3,35.5] 14.74931
9 4 2 13.120079 0.130744 3.000918 (10.3,35.5] 14.74931
10 5 2 27.898422 10.891313 20.912835 (10.3,35.5] 14.74931
评论:这些数字显然与您的 table 不同,因为我们有不同的随机数生成器种子。在您的 table 中,cut
的结果从 country 1
到 country 2
不等。由于切割是在所有国家/地区进行的,因此这种差异很可能是由于随机性造成的。如果这不是您所期望的,请提供可以复制原始 table 的种子。
我在 R 中有一个不同国家的小组,我想根据特定年份(此处为 3)的特定变量(在本例中为 'var3')的值创建类别。
我目前拥有的一个例子:
# create data
test.data = as.data.frame(matrix(rexp(200, rate=.1), ncol=5))
colnames(test.data) = c("year", "country", "var1", "var2", "var3")
test.data$year = rep.int(1:5, 8)
test.data$country = rep(1:8, each=5)
# calculate median, minimum and maximum of 'var3'
median = quantile(x = test.data[test.data$year == 3, 5], probs = c(0.5))
min = min(test.data[test.data$year == 3, 5])
max = max(test.data[test.data$year == 3, 5])
# create category variable based on values of 'var3'
test.data$cat.1 = cut(test.data$var3, c(min, median, max))
在这种情况下,'cat.1' 的值取决于相应观测的 'var3' 的值,但我希望它取决于特定国家特定年份的值(即我想要一个特定国家/地区所有年份的相同值)。有没有一种直接的方法可以做到这一点,还是我必须手动完成(select 每个组的国家并为它们分配值)。如果组的数量是恒定的,手动操作是可以的,但是如果你想尝试不同的组大小,这有点麻烦。
目前结果如下:
year country var1 var2 var3 cat.1
1 1 1 4.4206363 9.32628504 4.0988089 (1.2,6.71]
2 2 1 7.6072491 6.30949828 39.5694414 <NA>
3 3 1 3.3774183 7.94397550 8.8419793 (6.71,22.2]
4 4 1 1.0300372 9.93858310 0.4908481 <NA>
5 5 1 6.4514008 2.10367840 29.6052797 <NA>
6 1 2 8.7609877 5.76332181 17.4117561 (6.71,22.2]
7 2 2 6.1253021 0.17258071 23.9096280 <NA>
8 3 2 48.3335241 1.19255084 3.3644827 (1.2,6.71]
9 4 2 34.1683821 10.98216846 29.0255100 <NA>
10 5 2 15.5824154 2.53484781 16.3466249 (6.71,22.2]
但我想要这个:
year country var1 var2 var3 cat.1
1 1 1 4.4206363 9.32628504 4.0988089 (6.71,22.2]
2 2 1 7.6072491 6.30949828 39.5694414 (6.71,22.2]
3 3 1 3.3774183 7.94397550 8.8419793 (6.71,22.2]
4 4 1 1.0300372 9.93858310 0.4908481 (6.71,22.2]
5 5 1 6.4514008 2.10367840 29.6052797 (6.71,22.2]
6 1 2 8.7609877 5.76332181 17.4117561 (1.2,6.71]
7 2 2 6.1253021 0.17258071 23.9096280 (1.2,6.71]
8 3 2 48.3335241 1.19255084 3.3644827 (1.2,6.71]
9 4 2 34.1683821 10.98216846 29.0255100 (1.2,6.71]
10 5 2 15.5824154 2.53484781 16.3466249 (1.2,6.71]
也许是以下几行?这首先为每个国家/地区创建一个变量,对应于第 3 年的 var3
,然后削减该变量。这应该适用于许多组,如果按组你指的是国家。
library(dplyr)
out <- test.data %>% group_by(country) %>% mutate(to.cut = var3[year==3] )
out$cat.1 = cut(out$to.cut, c(min, median, max), include.lowest=T)
out
Source: local data frame [40 x 7]
Groups: country [8]
year country var1 var2 var3 cat.1 to.cut
(int) (int) (dbl) (dbl) (dbl) (fctr) (dbl)
1 1 1 2.945957 8.785060 21.820063 (10.3,35.5] 12.06913
2 2 1 1.473719 29.944750 6.915839 (10.3,35.5] 12.06913
3 3 1 8.880734 3.624519 12.069131 (10.3,35.5] 12.06913
4 4 1 31.746000 9.698126 5.929075 (10.3,35.5] 12.06913
5 5 1 34.639945 2.983025 15.438284 (10.3,35.5] 12.06913
6 1 2 16.757240 8.719741 27.412963 (10.3,35.5] 14.74931
7 2 2 1.155467 3.146425 1.730943 (10.3,35.5] 14.74931
8 3 2 1.738710 2.292280 14.749311 (10.3,35.5] 14.74931
9 4 2 13.120079 0.130744 3.000918 (10.3,35.5] 14.74931
10 5 2 27.898422 10.891313 20.912835 (10.3,35.5] 14.74931
评论:这些数字显然与您的 table 不同,因为我们有不同的随机数生成器种子。在您的 table 中,cut
的结果从 country 1
到 country 2
不等。由于切割是在所有国家/地区进行的,因此这种差异很可能是由于随机性造成的。如果这不是您所期望的,请提供可以复制原始 table 的种子。