R使用矩阵行和列名称逐行填充矩阵
R fill a matrix by row using matrix row and colum names
我有一个看起来像这样的数据集:
set.seed(2)
origin <- rep(c("DEU", "GBR", "ITA", "NLD", "CAN", "MEX", "USA", "CHN", "JPN", "KOR","DEU", "GBR", "ITA", "NLD", "CAN", "MEX", "USA", "CHN", "JPN", "KOR"), 2)
year <- rep(c(1998,1998,1998,1998,1998,1998,1998,1998,1998,1998,2000,2000,2000,2000,2000,2000,2000,2000,2000,2000), 2)
value <- sample(1:10000, size=length(origin), replace=TRUE)
test.df <- as.data.frame(cbind(origin, year, value))
rm(origin, year, value)
然后我有 2 个列表。
第一个是使用 ISOcodes
库构建的国家/地区列表,如下所示:
library("ISOcodes")
list.continent <- list(asia = c("Central Asia", "Eastern Asia", "South-eastern Asia", "Southern Asia", "Western Asia"),
africa = c("Northern Africa", "Sub-Saharan Africa", "Eastern Africa", "Middle Africa", "Southern Africa", "Western Africa"),
europe = c("Eastern Europe", "Northern Europe", "Channel Islands", "Southern Europe", "Western Europe"),
oceania = c("Australia and New Zealand", "Melanesia", "Micronesia", "Polynesia"),
northamerica = c("Northern America"),
latinamerica = c("South America", "Central America", "Caribbean"))
country.list.continent <- sapply(list.continent, function(item) {
region <- subset(UN_M.49_Regions, Name %in% item)
sub <- subset(UN_M.49_Countries, Code %in% unlist(strsplit(region$Children, ", ")))
return(sub$ISO_Alpha_3)
}, simplify = FALSE)
rm(list.continent)
还有一个带有年份的列表:
year.list <- levels(as.factor(unique(test.df$year)))
我想用与特定年份的精确区域相对应的计算数字填充矩阵。矩阵如下:
ncol <- length(year.list)
nrow <- length(country.list.continent)
matrix.extraction <- matrix(, nrow = nrow, ncol = ncol)
rownames(matrix.extraction) <- names(country.list.continent)
colnames(matrix.extraction) <- year.list
为了进行我的计算,我有一个循环可以对太大的数据集进行子集化,否则...该循环基于年份(相当于 colnames(matrix.extraction)
)。这个想法是计算每年代表每个国家/地区价值的(百分比)。计算部分足够简单并且运行良好。当我需要将值归因于每一行时,我的问题就出现了。
for(i in 1:length(colnames(matrix.extraction))){
### I subset and compute what I want
table.temp <- test.df %>%
subset(year == colnames(matrix.extraction)[i]) %>%
group_by(origin) %>%
summarise(value = sum(value, na.rm = TRUE))
table.temp$percent <- prop.table(table.temp$value)
### then I need to attribute the wanted values
matrix.extraction["ROWNAME",i] <- table.temp %>%
subset(origin %in% country.list.continent$"ROWNAME") %>%
summarise(. ,sum = sum(percent)))
}
我真的不知道我怎么能做这样的事。
预期结果是一个矩阵,如:
1998 2000
asia here NA
africa NA NA
europe NA NA
oceania NA NA
northamerica NA NA
latinamerica NA NA
用,而不是 [1,1] 中的 "here",colname 中 rowname 中该地区的每个国家/地区的年份值的总和。
如有任何帮助,我们将不胜感激。
使用双 sapply
我们可以遍历 year.list
和的所有组合
country.list.continent
并为每个组合计算 value
的 sum
。
sapply(year.list, function(x) sapply(names(country.list.continent), function(y) {
with(test.df, sum(value[origin %in% country.list.continent[[y]] & year == x]))
}))
# 1998 2000
#asia 21759 20059
#africa 0 0
#europe 39700 35981
#oceania 0 0
#northamerica 21347 17324
#latinamerica 10847 8672
如果我们对 tidyverse
解决方案感兴趣
library(tidyverse)
crossing(x = year.list, y = names(country.list.continent)) %>%
mutate(sum = map2_dbl(x, y, ~
test.df %>%
filter(year == .x & origin %in% country.list.continent[[.y]]) %>%
summarise(total = sum(value)) %>%
pull(total)))
# x y sum
# <chr> <chr> <dbl>
# 1 1998 africa 0
# 2 1998 asia 21759
# 3 1998 europe 39700
# 4 1998 latinamerica 10847
# 5 1998 northamerica 21347
# 6 1998 oceania 0
# 7 2000 africa 0
# 8 2000 asia 20059
# 9 2000 europe 35981
#10 2000 latinamerica 8672
#11 2000 northamerica 17324
#12 2000 oceania 0
您在 test.df
中将数字存储为因子,我们需要将它们更改为实际数字。 运行 在应用上述方法之前的以下内容。
test.df[-1] <- lapply(test.df[-1], function(x) as.numeric(as.character(x)))
我们可以在 tidyverse
中做到这一点。将命名的 list
转换为两列数据集(enframe
或 stack
),然后仅在 filter
ing 之后使用 'test.df' 执行 full_join
'year.list'中包含的'year',按'name分组,'year',得到'value'的sum
和spread
到'wide' 格式
library(tidyverse)
enframe(country.list.continent, value = "origin") %>%
unnest %>%
full_join(test.df %>%
filter(year %in% year.list)) %>%
group_by(name, year) %>%
summarise(value = sum(value, na.rm = TRUE)) %>%
spread(year, value, fill = 0) %>%
select(-4)
# A tibble: 6 x 3
# Groups: name [6]
# name `1998` `2000`
# <chr> <dbl> <dbl>
#1 africa 0 0
#2 asia 33038 18485
#3 europe 36658 35874
#4 latinamerica 14323 14808
#5 northamerica 15697 27405
#6 oceania 0 0
或者在 base R
中,这可以通过 stack
将 list
转换为两列 data.frame、merge
和 [=34] 来完成=] 在 subset
ing 之后,用 xtabs
创建一个 table
xtabs(value ~ ind + year, merge(stack(country.list.continent),
subset(test.df, year %in% year.list), by.x = "values", by.y = "origin"))
# year
#ind 1998 2000
# asia 33038 18485
# africa 0 0
# europe 36658 35874
# oceania 0 0
# northamerica 15697 27405
# latinamerica 14323 14808
数据
test.df <- data.frame(origin, year, value)
我有一个看起来像这样的数据集:
set.seed(2)
origin <- rep(c("DEU", "GBR", "ITA", "NLD", "CAN", "MEX", "USA", "CHN", "JPN", "KOR","DEU", "GBR", "ITA", "NLD", "CAN", "MEX", "USA", "CHN", "JPN", "KOR"), 2)
year <- rep(c(1998,1998,1998,1998,1998,1998,1998,1998,1998,1998,2000,2000,2000,2000,2000,2000,2000,2000,2000,2000), 2)
value <- sample(1:10000, size=length(origin), replace=TRUE)
test.df <- as.data.frame(cbind(origin, year, value))
rm(origin, year, value)
然后我有 2 个列表。
第一个是使用 ISOcodes
库构建的国家/地区列表,如下所示:
library("ISOcodes")
list.continent <- list(asia = c("Central Asia", "Eastern Asia", "South-eastern Asia", "Southern Asia", "Western Asia"),
africa = c("Northern Africa", "Sub-Saharan Africa", "Eastern Africa", "Middle Africa", "Southern Africa", "Western Africa"),
europe = c("Eastern Europe", "Northern Europe", "Channel Islands", "Southern Europe", "Western Europe"),
oceania = c("Australia and New Zealand", "Melanesia", "Micronesia", "Polynesia"),
northamerica = c("Northern America"),
latinamerica = c("South America", "Central America", "Caribbean"))
country.list.continent <- sapply(list.continent, function(item) {
region <- subset(UN_M.49_Regions, Name %in% item)
sub <- subset(UN_M.49_Countries, Code %in% unlist(strsplit(region$Children, ", ")))
return(sub$ISO_Alpha_3)
}, simplify = FALSE)
rm(list.continent)
还有一个带有年份的列表:
year.list <- levels(as.factor(unique(test.df$year)))
我想用与特定年份的精确区域相对应的计算数字填充矩阵。矩阵如下:
ncol <- length(year.list)
nrow <- length(country.list.continent)
matrix.extraction <- matrix(, nrow = nrow, ncol = ncol)
rownames(matrix.extraction) <- names(country.list.continent)
colnames(matrix.extraction) <- year.list
为了进行我的计算,我有一个循环可以对太大的数据集进行子集化,否则...该循环基于年份(相当于 colnames(matrix.extraction)
)。这个想法是计算每年代表每个国家/地区价值的(百分比)。计算部分足够简单并且运行良好。当我需要将值归因于每一行时,我的问题就出现了。
for(i in 1:length(colnames(matrix.extraction))){
### I subset and compute what I want
table.temp <- test.df %>%
subset(year == colnames(matrix.extraction)[i]) %>%
group_by(origin) %>%
summarise(value = sum(value, na.rm = TRUE))
table.temp$percent <- prop.table(table.temp$value)
### then I need to attribute the wanted values
matrix.extraction["ROWNAME",i] <- table.temp %>%
subset(origin %in% country.list.continent$"ROWNAME") %>%
summarise(. ,sum = sum(percent)))
}
我真的不知道我怎么能做这样的事。
预期结果是一个矩阵,如:
1998 2000
asia here NA
africa NA NA
europe NA NA
oceania NA NA
northamerica NA NA
latinamerica NA NA
用,而不是 [1,1] 中的 "here",colname 中 rowname 中该地区的每个国家/地区的年份值的总和。
如有任何帮助,我们将不胜感激。
使用双 sapply
我们可以遍历 year.list
和的所有组合
country.list.continent
并为每个组合计算 value
的 sum
。
sapply(year.list, function(x) sapply(names(country.list.continent), function(y) {
with(test.df, sum(value[origin %in% country.list.continent[[y]] & year == x]))
}))
# 1998 2000
#asia 21759 20059
#africa 0 0
#europe 39700 35981
#oceania 0 0
#northamerica 21347 17324
#latinamerica 10847 8672
如果我们对 tidyverse
解决方案感兴趣
library(tidyverse)
crossing(x = year.list, y = names(country.list.continent)) %>%
mutate(sum = map2_dbl(x, y, ~
test.df %>%
filter(year == .x & origin %in% country.list.continent[[.y]]) %>%
summarise(total = sum(value)) %>%
pull(total)))
# x y sum
# <chr> <chr> <dbl>
# 1 1998 africa 0
# 2 1998 asia 21759
# 3 1998 europe 39700
# 4 1998 latinamerica 10847
# 5 1998 northamerica 21347
# 6 1998 oceania 0
# 7 2000 africa 0
# 8 2000 asia 20059
# 9 2000 europe 35981
#10 2000 latinamerica 8672
#11 2000 northamerica 17324
#12 2000 oceania 0
您在 test.df
中将数字存储为因子,我们需要将它们更改为实际数字。 运行 在应用上述方法之前的以下内容。
test.df[-1] <- lapply(test.df[-1], function(x) as.numeric(as.character(x)))
我们可以在 tidyverse
中做到这一点。将命名的 list
转换为两列数据集(enframe
或 stack
),然后仅在 filter
ing 之后使用 'test.df' 执行 full_join
'year.list'中包含的'year',按'name分组,'year',得到'value'的sum
和spread
到'wide' 格式
library(tidyverse)
enframe(country.list.continent, value = "origin") %>%
unnest %>%
full_join(test.df %>%
filter(year %in% year.list)) %>%
group_by(name, year) %>%
summarise(value = sum(value, na.rm = TRUE)) %>%
spread(year, value, fill = 0) %>%
select(-4)
# A tibble: 6 x 3
# Groups: name [6]
# name `1998` `2000`
# <chr> <dbl> <dbl>
#1 africa 0 0
#2 asia 33038 18485
#3 europe 36658 35874
#4 latinamerica 14323 14808
#5 northamerica 15697 27405
#6 oceania 0 0
或者在 base R
中,这可以通过 stack
将 list
转换为两列 data.frame、merge
和 [=34] 来完成=] 在 subset
ing 之后,用 xtabs
创建一个 table
xtabs(value ~ ind + year, merge(stack(country.list.continent),
subset(test.df, year %in% year.list), by.x = "values", by.y = "origin"))
# year
#ind 1998 2000
# asia 33038 18485
# africa 0 0
# europe 36658 35874
# oceania 0 0
# northamerica 15697 27405
# latinamerica 14323 14808
数据
test.df <- data.frame(origin, year, value)