如何逐行求和?

How to sum rows by rows?

我现在是 R 的新手...所以我将人口普查研究作为一个大学项目进行。 用于说明,这是我 data.frame

的一部分
             MUN          X1990  X1991  X1992 X1993
1     Angra dos Reis (RJ)    11    10    10    10
2            Aperibé (RJ)    NA    NA    NA    NA
3           Araruama (RJ)  12040 14589 14231 14231
4              Areal (RJ)    NA    NA    NA     3
5 Armação dos Búzios (RJ)    NA    NA    NA    NA

我的问题是我需要对一些我 know/will 指定名称的城市行求和,(因为我不知道它会出现的顺序,或者它们是否会出现,在我的所有表格中) , 结果应显示在一行中。

举个例子, 我想将 "Areal" 行与 "Angra dos Reis" 行相加,结果存储在另一个创建的行中(我们称结果行为:X) 所以结果应该是:

             MUN          X1990  X1991  X1992 X1993
1     Angra dos Reis (RJ)    11    10    10    10
2            Aperibé (RJ)    NA    NA    NA    NA
3           Araruama (RJ)  12040 14589 14231 14231
4              Areal (RJ)    NA    NA    NA     3
5 Armação dos Búzios (RJ)    NA    NA    NA    NA
6          X                 11    10    10    13

我已经尝试创建一个 for 循环和一个 if 循环,但我做不到。

这与 Jaap 的评论非常相似,但更详细一些并明确使用了行名称:

mat = as.matrix(dat[, 2:5])
row.names(mat) = dat$MUN
mat = rbind(mat, colSums(mat[c("Angra dos Reis (RJ)", "Areal (RJ)"), ], na.rm = T))
row.names(mat)[nrow(mat)] = "X"
mat
#                         X1990 X1991 X1992 X1993
# Angra dos Reis (RJ)        11    10    10    10
# Aperibé (RJ)               NA    NA    NA    NA
# Araruama (RJ)           12040 14589 14231 14231
# Areal (RJ)                 NA    NA    NA     3
# Armação dos Búzios (RJ)    NA    NA    NA    NA
# X                          11    10    10    13

结果是 matrix,如果需要,您可以将其转换回数据框:

dat_result = data.frame(MUN = row.names(mat), mat, row.names = NULL)

我不喜欢你的数据格式为数据框。我要么将它转换为矩阵(如上所述),要么将其转换为长格式,例如 tidyr::gather(dat, key = year, value = value, -MUN) 并使用 "by group" 使用 data.tabledplyr.


使用此数据:

dat = read.table(text = "             MUN          X1990  X1991  X1992 X1993
1     'Angra dos Reis (RJ)'    11    10    10    10
2            'Aperibé (RJ)'    NA    NA    NA    NA
3           'Araruama (RJ)'  12040 14589 14231 14231
4              'Areal (RJ)'    NA    NA    NA     3
5 'Armação dos Búzios (RJ)'    NA    NA    NA    NA", header= T)

一个解决方案可以使用 sqldf 包。如果数据框的名称是df,你可以像下面这样:

library(sqldf)
result <- sqldf("SELECT * FROM df UNION 
       SELECT 'X', SUM(X1990), SUM(X1991), SUM(X1992), SUM(X1993) FROM df
       WHERE MUN IN ('Angra dos Reis (RJ)', 'Areal (RJ)')")

我假设您想对名字为 know/specify 的两个自治市的数据求和,然后在 table 的末尾添加它们的总和。我不确定这种理解是否正确。您可能需要再次指定您的问题,以防以下代码不是您所需要的(例如,关于您是否需要每次对多个城市求和或一次只对两个城市求和,等等)

此外,如果你需要多次调用我建议的函数或者你的table非常大,需要在速度上进行改进,例如使用包data.table 而不是 base R(既然你说你是初学者,我坚持使用 base R)。

为了满足您尽可能保留 NA 值的要求,我使用了 Joshua Ulrich as answer to this question rowSums but keeping NA values 提出的代码。

data <- data.frame(MUN = c("Angra dos Reis (RJ)", "Aperibé (RJ)", "Araruama (RJ)", "Areal (RJ)", "Armação dos Búzios (RJ)")
               ,X1990 = c(11, NA, 12040, NA, NA)
               ,X1991 = c(10, NA, 14589, NA, NA)
               ,X1992 = c(10, NA, 14231, NA, NA)
               ,X1993 = c(10, NA, 12231, 3, NA)
)

sum_rows <- function(df, row1, row2) {

  #get the indices of the two rows to be summed
  #grep returns the position in a vector at which a certain element is stored
  #here the name of the municipality 
  index_row1 <-  grep(row1, df$MUN, fixed=T)
  index_row2 <-  grep(row2, df$MUN, fixed=T)

  #select the two rows of the data.frame that you want to sum
  #on basis of the entry in the MUN column
  #further only select the column with numbers for the sum operation
  #check if all entries in a single column are NA values
  #if yes then the ouput for this column is NA
  #if no calculate the column sum, if one entry is NA, ignore it
  sum <- ifelse(apply(is.na(df[c(index_row1, index_row2),2:ncol(df)]),2,all)
                      ,NA
                      ,colSums(df[c(index_row1, index_row2),2:ncol(df)],na.rm=TRUE)
               )

  #create a name entry for the new MUN column
  #paste0 is used to combine strings
  #in this case it might make sense to create a name 
  #that includes the indices of the rows that have been summed instad of only using X as name
  name <- paste0("Sum_R",index_row1,"_R" , index_row2)

  #add the row to the original data.frame
  df <-  cbind(MUN = c(as.character(df$MUN), name)
               ,rbind(df[, 2:ncol(df)], sum)
              )

  #return the data.frame from the function
  df

} 

#sum two rows and replace your data.frame by the new result
data <- sum_rows(data, "Angra dos Reis (RJ)", "Areal (RJ)")

data <- sum_rows(data, "Armação dos Búzios (RJ)", "Areal (RJ)")

这是一个dplyr解决方案:

library(dplyr)
df %>%
  filter(MUN %in% c("Angra dos Reis (RJ)", "Areal (RJ)")) %>%
  summarize_if(is.numeric, sum, na.rm = TRUE) %>%
  as.list(.) %>%
  c(MUN = "X") %>%
  bind_rows(df, .)

结果:

                      MUN X1990 X1991 X1992 X1993
1     Angra dos Reis (RJ)    11    10    10    10
2            Aperibé (RJ)    NA    NA    NA    NA
3           Araruama (RJ) 12040 14589 14231 14231
4              Areal (RJ)    NA    NA    NA     3
5 Armação dos Búzios (RJ)    NA    NA    NA    NA
6                       X    11    10    10    13

数据(来自@Gregor stringsAsFactors = FALSE):

df = read.table(text = "             MUN          X1990  X1991  X1992 X1993
                 1     'Angra dos Reis (RJ)'    11    10    10    10
                 2            'Aperibé (RJ)'    NA    NA    NA    NA
                 3           'Araruama (RJ)'  12040 14589 14231 14231
                 4              'Areal (RJ)'    NA    NA    NA     3
                 5 'Armação dos Búzios (RJ)'    NA    NA    NA    NA", header= T, stringsAsFactors = FALSE)