将 tables 来自多张纸的结果汇总为 R 中的一张 table

Question

我正在阅读包含多个 sheet 的 excel 文件。

 file_to_read <- "./file_name.xlsx"
 
 # Get all names of sheets in the file
 sheet_names <- readxl::excel_sheets(file_to_read)
 
 # Loop through sheets
 L <- lapply(sheet_names, function(x) {
 all_cells <-
 tidyxl::xlsx_cells(file_to_read, sheets = x)
})

L 这里有所有 sheets。现在，我需要从每个 sheet 中获取数据，以将所有列和行合并到一个文件中。确切的说，我想将数据中匹配的列和行汇总到一个文件中。

我会举个简单的例子来说明。

比如这个table在一个sheet,

df1 <- data.frame(x = 1:5, y = 2:6, z = 3:7)
rownames(df1) <- LETTERS[1:5]
df1
M x y z
A 1 2 3
B 2 3 4
C 3 4 5
D 4 5 6
E 5 6 7

第二个table下一个sheet,

df2 <- data.frame(x = 1:5, y = 2:6, z = 3:7, w = 8:12)
rownames(df2) <- LETTERS[3:7]
df2
M x y z  w
C 1 2 3  8
D 2 3 4  9
E 3 4 5 10
F 4 5 6 11
G 5 6 7 12

我的目标是合并（求和）来自一个 excel 文件的所有 100 tables 中的匹配记录，得到一个大的 tables，其中每个记录的总和值。

最后的table应该是这样的：

M x y  z   w
A 1 2  3   0
B 2 3  4   0
C 4 6  8   8
D 6 8  10  9
E 8 10 12 10
F 4 5  6  11
G 5 6  7  12

有没有办法在 R 中实现这一点？我不是 R 方面的专家，但我希望我能知道如何读取所有 sheets 并求和然后将输出保存到文件中。

谢谢

Answer 1

您可以使用 dplyr 和 tidyr 来获得您想要的结果：

让

df <- data.frame(subject=c(rep("Mother", 2), rep("Child", 2)), modifier=c("chart2", "child", "tech", "unkn"), mother_chart2=1:4, mother_child=5:8, child_tech=9:12, child_unkn=13:16)
> df
  subject modifier mother_chart2 mother_child child_tech child_unkn
1  Mother   chart2             1            5          9         13
2  Mother    child             2            6         10         14
3   Child     tech             3            7         11         15
4   Child     unkn             4            8         12         16

和

df2 <- data.frame(subject=c(rep("Mother", 2), rep("Child", 2)), modifier=c("chart", "child", "tech", "unkn"), mother_chart=101:104, mother_child=105:108, child_tech=109:112, child_unkn=113:116)

> df2
  subject modifier mother_chart mother_child child_tech child_unkn
1  Mother    chart          101          105        109        113
2  Mother    child          102          106        110        114
3   Child     tech          103          107        111        115
4   Child     unkn          104          108        112        116

然后

library(dplyr)
library(tidyr)

df2_tmp <- df2 %>%
  pivot_longer(col=-c("subject", "modifier"))

df %>%
  pivot_longer(col=-c("subject", "modifier")) %>%
  full_join(df2_tmp, by=c("subject", "modifier", "name")) %>%
  mutate(across(starts_with("value"), ~ replace_na(., 0)),
         sum = value.x + value.y) %>%
  select(-value.x, -value.y) %>%
  pivot_wider(names_from=name, values_from=sum, values_fill=0)

returns

# A tibble: 5 x 7
  subject modifier mother_chart2 mother_child child_tech child_unkn mother_chart
  <chr>   <chr>            <dbl>        <dbl>      <dbl>      <dbl>        <dbl>
1 Mother  chart2               1            5          9         13            0
2 Mother  child                2          112        120        128          102
3 Child   tech                 3          114        122        130          103
4 Child   unkn                 4          116        124        132          104
5 Mother  chart                0          105        109        113          101

Answer 2

正如您所说的那样，您有数百张纸，建议您将所有这些导入一个列表中，例如 my.list in R（按照 this link or this readxl documentation 建议）并按照此操作策略而不是一个一个地绑定每两个dfs

df1 <- read.table(text = 'M x y z
A 1 2 3
B 2 3 4
C 3 4 5
D 4 5 6
E 5 6 7', header = T)
df2 <- read.table(text = 'M x y z  w
C 1 2 3  8
D 2 3 4  9
E 3 4 5 10
F 4 5 6 11
G 5 6 7 12', header = T)

library(tibble)
library(tidyverse)

my.list <- list(df1, df2)

map_dfr(my.list, ~.x)
#>    M x y z  w
#> 1  A 1 2 3 NA
#> 2  B 2 3 4 NA
#> 3  C 3 4 5 NA
#> 4  D 4 5 6 NA
#> 5  E 5 6 7 NA
#> 6  C 1 2 3  8
#> 7  D 2 3 4  9
#> 8  E 3 4 5 10
#> 9  F 4 5 6 11
#> 10 G 5 6 7 12
map_dfr(my.list , ~ .x) %>%
  group_by(M) %>%
  summarise(across(everything(), sum, na.rm = T))
#> # A tibble: 7 x 5
#>   M         x     y     z     w
#>   <chr> <int> <int> <int> <int>
#> 1 A         1     2     3     0
#> 2 B         2     3     4     0
#> 3 C         4     6     8     8
#> 4 D         6     8    10     9
#> 5 E         8    10    12    10
#> 6 F         4     5     6    11
#> 7 G         5     6     7    12

^{由 reprex package (v2.0.0)}

于 2021-05-26 创建

Answer 3

一种可行的方法是以下步骤：

将每个 sheet 读入列表
将每个sheet转换成长格式
绑定到单个数据框
对那个长数据帧进行求和和分组
转换回表格格式

这应该适用于 N sheet 行和列 headers 中任意组合的 sheet 行。例如

file <- "D:\Book1.xlsx"
sheet_names <- readxl::excel_sheets(file)
sheet_data <- lapply(sheet_names, function(sheet_name) {
  readxl::read_xlsx(path = file, sheet = sheet_name)
})

# use pivot_longer on each sheet to make long data
long_sheet_data <- lapply(sheet_data, function(data) {
  long <- tidyr::pivot_longer(
    data = data,
    cols = !M,
    names_to = "col",
    values_to = "val"
  )
})

# combine into a single tibble
long_data = dplyr::bind_rows(long_sheet_data)

# sum up matching pairs of `M` and `col`
summarised <- long_data %>%
  group_by(M, col) %>%
  dplyr::summarise(agg = sum(val))
  
# convert to a tabular format
tabular <- summarised %>%
  tidyr::pivot_wider(
    names_from = col,
    values_from = agg,
    values_fill = 0
  )

tabular

我使用您的初始输入sheet得到了这个输出：

> tabular
# A tibble: 7 x 5
# Groups:   M [7]
  M         x     y     z     w
  <chr> <dbl> <dbl> <dbl> <dbl>
1 A         1     2     3     0
2 B         2     3     4     0
3 C         4     6     8     8
4 D         6     8    10     9
5 E         8    10    12    10
6 F         4     5     6    11
7 G         5     6     7    12

将 tables 来自多张纸的结果汇总为 R 中的一张 table

Sum up tables results from multiple sheets into one table in R

excel

r

contingency