如何逐行求和?
How to sum rows by rows?
我现在是 R 的新手...所以我将人口普查研究作为一个大学项目进行。
用于说明,这是我 data.frame
的一部分
MUN X1990 X1991 X1992 X1993
1 Angra dos Reis (RJ) 11 10 10 10
2 Aperibé (RJ) NA NA NA NA
3 Araruama (RJ) 12040 14589 14231 14231
4 Areal (RJ) NA NA NA 3
5 Armação dos Búzios (RJ) NA NA NA NA
我的问题是我需要对一些我 know/will 指定名称的城市行求和,(因为我不知道它会出现的顺序,或者它们是否会出现,在我的所有表格中) , 结果应显示在一行中。
举个例子,
我想将 "Areal" 行与 "Angra dos Reis" 行相加,结果存储在另一个创建的行中(我们称结果行为:X)
所以结果应该是:
MUN X1990 X1991 X1992 X1993
1 Angra dos Reis (RJ) 11 10 10 10
2 Aperibé (RJ) NA NA NA NA
3 Araruama (RJ) 12040 14589 14231 14231
4 Areal (RJ) NA NA NA 3
5 Armação dos Búzios (RJ) NA NA NA NA
6 X 11 10 10 13
我已经尝试创建一个 for 循环和一个 if 循环,但我做不到。
这与 Jaap 的评论非常相似,但更详细一些并明确使用了行名称:
mat = as.matrix(dat[, 2:5])
row.names(mat) = dat$MUN
mat = rbind(mat, colSums(mat[c("Angra dos Reis (RJ)", "Areal (RJ)"), ], na.rm = T))
row.names(mat)[nrow(mat)] = "X"
mat
# X1990 X1991 X1992 X1993
# Angra dos Reis (RJ) 11 10 10 10
# Aperibé (RJ) NA NA NA NA
# Araruama (RJ) 12040 14589 14231 14231
# Areal (RJ) NA NA NA 3
# Armação dos Búzios (RJ) NA NA NA NA
# X 11 10 10 13
结果是 matrix
,如果需要,您可以将其转换回数据框:
dat_result = data.frame(MUN = row.names(mat), mat, row.names = NULL)
我不喜欢你的数据格式为数据框。我要么将它转换为矩阵(如上所述),要么将其转换为长格式,例如 tidyr::gather(dat, key = year, value = value, -MUN)
并使用 "by group" 使用 data.table
或 dplyr
.
使用此数据:
dat = read.table(text = " MUN X1990 X1991 X1992 X1993
1 'Angra dos Reis (RJ)' 11 10 10 10
2 'Aperibé (RJ)' NA NA NA NA
3 'Araruama (RJ)' 12040 14589 14231 14231
4 'Areal (RJ)' NA NA NA 3
5 'Armação dos Búzios (RJ)' NA NA NA NA", header= T)
一个解决方案可以使用 sqldf 包。如果数据框的名称是df
,你可以像下面这样:
library(sqldf)
result <- sqldf("SELECT * FROM df UNION
SELECT 'X', SUM(X1990), SUM(X1991), SUM(X1992), SUM(X1993) FROM df
WHERE MUN IN ('Angra dos Reis (RJ)', 'Areal (RJ)')")
我假设您想对名字为 know/specify 的两个自治市的数据求和,然后在 table 的末尾添加它们的总和。我不确定这种理解是否正确。您可能需要再次指定您的问题,以防以下代码不是您所需要的(例如,关于您是否需要每次对多个城市求和或一次只对两个城市求和,等等)
此外,如果你需要多次调用我建议的函数或者你的table非常大,需要在速度上进行改进,例如使用包data.table
而不是 base R(既然你说你是初学者,我坚持使用 base R)。
为了满足您尽可能保留 NA 值的要求,我使用了 Joshua Ulrich as answer to this question rowSums but keeping NA values 提出的代码。
data <- data.frame(MUN = c("Angra dos Reis (RJ)", "Aperibé (RJ)", "Araruama (RJ)", "Areal (RJ)", "Armação dos Búzios (RJ)")
,X1990 = c(11, NA, 12040, NA, NA)
,X1991 = c(10, NA, 14589, NA, NA)
,X1992 = c(10, NA, 14231, NA, NA)
,X1993 = c(10, NA, 12231, 3, NA)
)
sum_rows <- function(df, row1, row2) {
#get the indices of the two rows to be summed
#grep returns the position in a vector at which a certain element is stored
#here the name of the municipality
index_row1 <- grep(row1, df$MUN, fixed=T)
index_row2 <- grep(row2, df$MUN, fixed=T)
#select the two rows of the data.frame that you want to sum
#on basis of the entry in the MUN column
#further only select the column with numbers for the sum operation
#check if all entries in a single column are NA values
#if yes then the ouput for this column is NA
#if no calculate the column sum, if one entry is NA, ignore it
sum <- ifelse(apply(is.na(df[c(index_row1, index_row2),2:ncol(df)]),2,all)
,NA
,colSums(df[c(index_row1, index_row2),2:ncol(df)],na.rm=TRUE)
)
#create a name entry for the new MUN column
#paste0 is used to combine strings
#in this case it might make sense to create a name
#that includes the indices of the rows that have been summed instad of only using X as name
name <- paste0("Sum_R",index_row1,"_R" , index_row2)
#add the row to the original data.frame
df <- cbind(MUN = c(as.character(df$MUN), name)
,rbind(df[, 2:ncol(df)], sum)
)
#return the data.frame from the function
df
}
#sum two rows and replace your data.frame by the new result
data <- sum_rows(data, "Angra dos Reis (RJ)", "Areal (RJ)")
data <- sum_rows(data, "Armação dos Búzios (RJ)", "Areal (RJ)")
这是一个dplyr
解决方案:
library(dplyr)
df %>%
filter(MUN %in% c("Angra dos Reis (RJ)", "Areal (RJ)")) %>%
summarize_if(is.numeric, sum, na.rm = TRUE) %>%
as.list(.) %>%
c(MUN = "X") %>%
bind_rows(df, .)
结果:
MUN X1990 X1991 X1992 X1993
1 Angra dos Reis (RJ) 11 10 10 10
2 Aperibé (RJ) NA NA NA NA
3 Araruama (RJ) 12040 14589 14231 14231
4 Areal (RJ) NA NA NA 3
5 Armação dos Búzios (RJ) NA NA NA NA
6 X 11 10 10 13
数据(来自@Gregor stringsAsFactors = FALSE
):
df = read.table(text = " MUN X1990 X1991 X1992 X1993
1 'Angra dos Reis (RJ)' 11 10 10 10
2 'Aperibé (RJ)' NA NA NA NA
3 'Araruama (RJ)' 12040 14589 14231 14231
4 'Areal (RJ)' NA NA NA 3
5 'Armação dos Búzios (RJ)' NA NA NA NA", header= T, stringsAsFactors = FALSE)
我现在是 R 的新手...所以我将人口普查研究作为一个大学项目进行。 用于说明,这是我 data.frame
的一部分 MUN X1990 X1991 X1992 X1993
1 Angra dos Reis (RJ) 11 10 10 10
2 Aperibé (RJ) NA NA NA NA
3 Araruama (RJ) 12040 14589 14231 14231
4 Areal (RJ) NA NA NA 3
5 Armação dos Búzios (RJ) NA NA NA NA
我的问题是我需要对一些我 know/will 指定名称的城市行求和,(因为我不知道它会出现的顺序,或者它们是否会出现,在我的所有表格中) , 结果应显示在一行中。
举个例子, 我想将 "Areal" 行与 "Angra dos Reis" 行相加,结果存储在另一个创建的行中(我们称结果行为:X) 所以结果应该是:
MUN X1990 X1991 X1992 X1993
1 Angra dos Reis (RJ) 11 10 10 10
2 Aperibé (RJ) NA NA NA NA
3 Araruama (RJ) 12040 14589 14231 14231
4 Areal (RJ) NA NA NA 3
5 Armação dos Búzios (RJ) NA NA NA NA
6 X 11 10 10 13
我已经尝试创建一个 for 循环和一个 if 循环,但我做不到。
这与 Jaap 的评论非常相似,但更详细一些并明确使用了行名称:
mat = as.matrix(dat[, 2:5])
row.names(mat) = dat$MUN
mat = rbind(mat, colSums(mat[c("Angra dos Reis (RJ)", "Areal (RJ)"), ], na.rm = T))
row.names(mat)[nrow(mat)] = "X"
mat
# X1990 X1991 X1992 X1993
# Angra dos Reis (RJ) 11 10 10 10
# Aperibé (RJ) NA NA NA NA
# Araruama (RJ) 12040 14589 14231 14231
# Areal (RJ) NA NA NA 3
# Armação dos Búzios (RJ) NA NA NA NA
# X 11 10 10 13
结果是 matrix
,如果需要,您可以将其转换回数据框:
dat_result = data.frame(MUN = row.names(mat), mat, row.names = NULL)
我不喜欢你的数据格式为数据框。我要么将它转换为矩阵(如上所述),要么将其转换为长格式,例如 tidyr::gather(dat, key = year, value = value, -MUN)
并使用 "by group" 使用 data.table
或 dplyr
.
使用此数据:
dat = read.table(text = " MUN X1990 X1991 X1992 X1993
1 'Angra dos Reis (RJ)' 11 10 10 10
2 'Aperibé (RJ)' NA NA NA NA
3 'Araruama (RJ)' 12040 14589 14231 14231
4 'Areal (RJ)' NA NA NA 3
5 'Armação dos Búzios (RJ)' NA NA NA NA", header= T)
一个解决方案可以使用 sqldf 包。如果数据框的名称是df
,你可以像下面这样:
library(sqldf)
result <- sqldf("SELECT * FROM df UNION
SELECT 'X', SUM(X1990), SUM(X1991), SUM(X1992), SUM(X1993) FROM df
WHERE MUN IN ('Angra dos Reis (RJ)', 'Areal (RJ)')")
我假设您想对名字为 know/specify 的两个自治市的数据求和,然后在 table 的末尾添加它们的总和。我不确定这种理解是否正确。您可能需要再次指定您的问题,以防以下代码不是您所需要的(例如,关于您是否需要每次对多个城市求和或一次只对两个城市求和,等等)
此外,如果你需要多次调用我建议的函数或者你的table非常大,需要在速度上进行改进,例如使用包data.table
而不是 base R(既然你说你是初学者,我坚持使用 base R)。
为了满足您尽可能保留 NA 值的要求,我使用了 Joshua Ulrich as answer to this question rowSums but keeping NA values 提出的代码。
data <- data.frame(MUN = c("Angra dos Reis (RJ)", "Aperibé (RJ)", "Araruama (RJ)", "Areal (RJ)", "Armação dos Búzios (RJ)")
,X1990 = c(11, NA, 12040, NA, NA)
,X1991 = c(10, NA, 14589, NA, NA)
,X1992 = c(10, NA, 14231, NA, NA)
,X1993 = c(10, NA, 12231, 3, NA)
)
sum_rows <- function(df, row1, row2) {
#get the indices of the two rows to be summed
#grep returns the position in a vector at which a certain element is stored
#here the name of the municipality
index_row1 <- grep(row1, df$MUN, fixed=T)
index_row2 <- grep(row2, df$MUN, fixed=T)
#select the two rows of the data.frame that you want to sum
#on basis of the entry in the MUN column
#further only select the column with numbers for the sum operation
#check if all entries in a single column are NA values
#if yes then the ouput for this column is NA
#if no calculate the column sum, if one entry is NA, ignore it
sum <- ifelse(apply(is.na(df[c(index_row1, index_row2),2:ncol(df)]),2,all)
,NA
,colSums(df[c(index_row1, index_row2),2:ncol(df)],na.rm=TRUE)
)
#create a name entry for the new MUN column
#paste0 is used to combine strings
#in this case it might make sense to create a name
#that includes the indices of the rows that have been summed instad of only using X as name
name <- paste0("Sum_R",index_row1,"_R" , index_row2)
#add the row to the original data.frame
df <- cbind(MUN = c(as.character(df$MUN), name)
,rbind(df[, 2:ncol(df)], sum)
)
#return the data.frame from the function
df
}
#sum two rows and replace your data.frame by the new result
data <- sum_rows(data, "Angra dos Reis (RJ)", "Areal (RJ)")
data <- sum_rows(data, "Armação dos Búzios (RJ)", "Areal (RJ)")
这是一个dplyr
解决方案:
library(dplyr)
df %>%
filter(MUN %in% c("Angra dos Reis (RJ)", "Areal (RJ)")) %>%
summarize_if(is.numeric, sum, na.rm = TRUE) %>%
as.list(.) %>%
c(MUN = "X") %>%
bind_rows(df, .)
结果:
MUN X1990 X1991 X1992 X1993
1 Angra dos Reis (RJ) 11 10 10 10
2 Aperibé (RJ) NA NA NA NA
3 Araruama (RJ) 12040 14589 14231 14231
4 Areal (RJ) NA NA NA 3
5 Armação dos Búzios (RJ) NA NA NA NA
6 X 11 10 10 13
数据(来自@Gregor stringsAsFactors = FALSE
):
df = read.table(text = " MUN X1990 X1991 X1992 X1993
1 'Angra dos Reis (RJ)' 11 10 10 10
2 'Aperibé (RJ)' NA NA NA NA
3 'Araruama (RJ)' 12040 14589 14231 14231
4 'Areal (RJ)' NA NA NA 3
5 'Armação dos Búzios (RJ)' NA NA NA NA", header= T, stringsAsFactors = FALSE)