旋转数据框以在 R 中保留列标题和 sub-headings

Pivot dataframe to keep column headings and sub-headings in R

我正在尝试旋转具有标题和 sub-headings 的 table,以便标题进入“日期”列,而副标题是两列而不是重复。

这是我的数据示例。

这是使用 dput() 生成的,因此在原始 excel 文件中,每个日期都跨越 sub-headings(“蓝色”和“绿色”),一次在 R 中,这些空白单元格是 re-named X.1、X.2 等

table <- "          X    X.1 X02.Jul.12   X.2 X03.Jul.12   X.3 X04.Jul.12   X.4
1  category number       blue green       blue green       blue green
2         G      1          1     0          1     0          1     0
3         G      2          2    99          2    99          1    99
4         G      3          1     1          1    99          1    99
5         G      4          1     1          1     1          2    99
6         G      5          1     0          1     0          1    99
7         G      6          1    99          1     1          1    99
8         G      7          1     0          1     0          1     0
9         G      8          1     1          1     1          1    99
10        G      9          1     1          1     1          1     1
11        H      1          1     1          1     1          1     1
12        H      2          1    99          1     0          1     0
13        H      3          1     1          1     1          1    99
14        H      4          1    99          1     2          1    99
15        H      5          1     1          1     1          1     1
16        H      6          1     0          1     0          1    99
17        H      7          1     1          2     1          1    99
18        H      8          2     0          2     0          1     1
19        H      9          2     0          2     0          1     1"

#Create a dataframe with the above table
df <- read.table(text=table, header = TRUE)
df

这是 Excel 中的示例:

这是我想要实现的期望输出:

虽然这可以在 Excel 中手动完成,但我有多个超过 100 个的文件 dates/columns,所以我更愿意找到一种在 R 中清理它的方法。

如有任何帮助,我们将不胜感激!

Excel Reprex

这是数据集的代表,就好像它是从 Excel 读取的,没有更正名称:

# Define the dataset.
df_excel <- structure(
  list(
    c("category", "G", "G", "G", "G", "G", "G", "G", "G", "G", "H", "H", "H", "H", "H", "H", "H", "H", "H"),
    c("number", "1", "2", "3", "4", "5", "6", "7", "8", "9", "1", "2", "3", "4", "5", "6", "7", "8", "9"),
    `02.Jul.12` = c("blue", "1", "2", "1", "1", "1", "1", "1", "1", "1", "1", "1", "1", "1", "1", "1", "1", "2", "2"),
    c("green", "0", "99", "1", "1", "0", "99", "0", "1", "1", "1", "99", "1", "99", "1", "0", "1", "0", "0"),
    `03.Jul.12` = c("blue", "1", "2", "1", "1", "1", "1", "1", "1", "1", "1", "1", "1", "1", "1", "1", "2", "2", "2"),
    c("green", "0", "99", "99", "1", "0", "1", "0", "1", "1", "1", "0", "1", "2", "1", "0", "1", "0", "0"),
    `04.Jul.12` = c("blue", "1", "1", "1", "2", "1", "1", "1", "1", "1", "1", "1", "1", "1", "1", "1", "1", "1", "1"),
    c("green", "0", "99", "99", "99", "99", "99", "0", "99", "1", "1", "0", "99", "99", "1", "99", "99", "1", "1")
  ),
  class = "data.frame",
  row.names = c("1", "2", "3", "4", "5", "6", "7", "8", "9", "10", "11", "12", "13", "14", "15", "16", "17", "18", "19")
)

# Save dataset in Excel file ('reprex.xlsx') for reproducibility.
openxlsx::write.xlsx(x = df_excel, file = "./reprex.xlsx")

以下代码应该会产生您想要的输出,尽管其他人可能有更优雅的解决方案:

#Create a dataframe with the above table
df <- read.table(text=table, header = TRUE)
df

names(df) <- df[1,]
library(lubridate); library(tidyr)

startdate <- dmy("02-Jul-12")
for (i in seq(3, ncol(df), by = 2)){
  names(df)[i:(i+1)] <- paste0(startdate, ":", names(df)[i:(i+1)])
  startdate <- startdate+1
}

df.tdy <- df[-1,] %>% pivot_longer(-c("category","number"), names_to = "datecol", values_to = "value") %>% 
  separate(datecol, c("date","color"), sep = ":") %>%
  pivot_wider(names_from = "color") %>%
    arrange(date,category,number)

# category number date       blue  green
# <chr>    <chr>  <chr>      <chr> <chr>
#   1 G        1      2012-07-02 1     0    
# 2 G        2      2012-07-02 2     99   
# 3 G        3      2012-07-02 1     1    
# 4 G        4      2012-07-02 1     1    
# 5 G        5      2012-07-02 1     0    
# 6 G        6      2012-07-02 1     99   
# 7 G        7      2012-07-02 1     0    
# 8 G        8      2012-07-02 1     1    
# 9 G        9      2012-07-02 1     1    
# 10 H        1      2012-07-02 1     1   

这是另一个使用基数 R 和 tidyverse 组合的选项。在这里,我首先通过在左侧列(即“绿色”)的列名称中包含日期来清理列名称,以便每列都有一个日期。然后,我将 header 与sub-heading,前 2 列除外(即 categorynumber)。然后,我删除第一行并转为长格式,其中日期和颜色在一列中留在他们自己的专栏中。

library(tidyverse)

colnames(df)[seq(2, ncol(df), 2)] <- colnames(df)[seq(1, ncol(df), 2)]

colnames(df) <-
  c(df[1, 1], df[1, 2], paste(sep = '_', colnames(df)[3:ncol(df)], as.character(unlist(df[1, 3:ncol(df)]))))

df %>%
  slice(-1) %>%
  pivot_longer(-c(category, number),
               names_to = c("Date", ".value"),
               names_sep = "_") %>%
  arrange(Date, category, number) %>%
  mutate(Date = dmy(Date))

输出

# A tibble: 54 × 5
   category number Date       blue  green
   <chr>    <chr>  <date>     <chr> <chr>
 1 G        1      2012-07-02 1     0    
 2 G        2      2012-07-02 2     99   
 3 G        3      2012-07-02 1     1    
 4 G        4      2012-07-02 1     1    
 5 G        5      2012-07-02 1     0    
 6 G        6      2012-07-02 1     99   
 7 G        7      2012-07-02 1     0    
 8 G        8      2012-07-02 1     1    
 9 G        9      2012-07-02 1     1    
10 H        1      2012-07-02 1     1    
# … with 44 more rows

数据

df <- structure(
  list(
    c("category", "G", "G", "G", "G", "G", "G", "G", "G", "G", "H", "H", "H", "H", "H", "H", "H", "H", "H"),
    c("number", "1", "2", "3", "4", "5", "6", "7", "8", "9", "1", "2", "3", "4", "5", "6", "7", "8", "9"),
    `02.Jul.12` = c("blue", "1", "2", "1", "1", "1", "1", "1", "1", "1", "1", "1", "1", "1", "1", "1", "1", "2", "2"),
    c("green", "0", "99", "1", "1", "0", "99", "0", "1", "1", "1", "99", "1", "99", "1", "0", "1", "0", "0"),
    `03.Jul.12` = c("blue", "1", "2", "1", "1", "1", "1", "1", "1", "1", "1", "1", "1", "1", "1", "1", "2", "2", "2"),
    c("green", "0", "99", "99", "1", "0", "1", "0", "1", "1", "1", "0", "1", "2", "1", "0", "1", "0", "0"),
    `04.Jul.12` = c("blue", "1", "1", "1", "2", "1", "1", "1", "1", "1", "1", "1", "1", "1", "1", "1", "1", "1", "1"),
    c("green", "0", "99", "99", "99", "99", "99", "0", "99", "1", "1", "0", "99", "99", "1", "99", "99", "1", "1")
  ),
  class = "data.frame",
  row.names = c("1", "2", "3", "4", "5", "6", "7", "8", "9", "10", "11", "12", "13", "14", "15", "16", "17", "18", "19")
)

如果您有其他列(例如颜色),则可以调整 colnames 的替换方式。因此,我首先创建一个序列 (a),从第一个日期列(即 3)开始到最后一列。然后,我从 a 创建 2 个序列,其中 b 具有空列名称的索引(不包括前 2 列),c 具有日期列名称。然后,我将日期复制两次,以便它们可以替换每个日期的 2 个空列名称(greenred)。然后,在下一步中,我只复制前两列名称(即 categorynumber),然后将其余标题(即日期)粘贴到 sub-heading。然后,流程同上。

a <- seq(3, ncol(df2))
b <- a[!(a%%3==0)]
c <- a[(a%%3==0)]

colnames(df2)[b] <- colnames(df2)[sort(rep(c, 2))]

colnames(df2) <-
  c(df2[1, 1], df2[1, 2], paste(sep = '_', colnames(df2)[3:ncol(df2)], as.character(unlist(df2[1, 3:ncol(df2)]))))

df2 %>%
  slice(-1) %>%
  pivot_longer(-c(category, number),
               names_to = c("Date", ".value"),
               names_sep = "_") %>%
  arrange(Date, category, number) %>%
  mutate(Date = lubridate::dmy(Date))

输出

# A tibble: 54 × 6
   category number Date       blue  green red  
   <chr>    <chr>  <date>     <chr> <chr> <chr>
 1 G        1      2012-07-02 1     0     1    
 2 G        2      2012-07-02 2     99    2    
 3 G        3      2012-07-02 1     1     1    
 4 G        4      2012-07-02 1     1     1    
 5 G        5      2012-07-02 1     0     1    
 6 G        6      2012-07-02 1     99    1    
 7 G        7      2012-07-02 1     0     1    
 8 G        8      2012-07-02 1     1     1    
 9 G        9      2012-07-02 1     1     1    
10 H        1      2012-07-02 1     1     1    
# … with 44 more rows

数据

df2 <- structure(
    list(
      c("category", "G", "G", "G", "G", "G", "G", "G", "G", "G", "H", "H", "H", "H", "H", "H", "H", "H", "H"),
      c("number", "1", "2", "3", "4", "5", "6", "7", "8", "9", "1", "2", "3", "4", "5", "6", "7", "8", "9"),
      `02.Jul.12` = c("blue", "1", "2", "1", "1", "1", "1", "1", "1", "1", "1", "1", "1", "1", "1", "1", "1", "2", "2"),
      c("green", "0", "99", "1", "1", "0", "99", "0", "1", "1", "1", "99", "1", "99", "1", "0", "1", "0", "0"),
      c("red", "1", "2", "1", "1", "1", "1", "1", "1", "1", "1", "1", "1", "1", "1", "1", "1", "2", "2"),
      `03.Jul.12` = c("blue", "1", "2", "1", "1", "1", "1", "1", "1", "1", "1", "1", "1", "1", "1", "1", "2", "2", "2"),
      c("green", "0", "99", "99", "1", "0", "1", "0", "1", "1", "1", "0", "1", "2", "1", "0", "1", "0", "0"),
      c("red", "1", "2", "1", "1", "1", "1", "1", "1", "1", "1", "1", "1", "1", "1", "1", "1", "2", "2"),
      `04.Jul.12` = c("blue", "1", "1", "1", "2", "1", "1", "1", "1", "1", "1", "1", "1", "1", "1", "1", "1", "1", "1"),
      c("green", "0", "99", "99", "99", "99", "99", "0", "99", "1", "1", "0", "99", "99", "1", "99", "99", "1", "1"),
      c("red", "1", "2", "1", "1", "1", "1", "1", "1", "1", "1", "1", "1", "1", "1", "1", "1", "2", "2")
    ),
    class = "data.frame",
    row.names = c("1", "2", "3", "4", "5", "6", "7", "8", "9", "10", "11", "12", "13", "14", "15", "16", "17", "18", "19")
  )
df %>%
  set_names(enframe(unlist(df[1,])) %>%
  mutate(name = na_if(name, ''))%>%
  fill(name)%>%
  transmute(nms = coalesce(str_c(name, value, sep='_'), value)) %>%
  pull(nms)) %>%
  slice(-1)%>%
  type.convert(as.is = TRUE)%>%
  pivot_longer(-c(category, number), names_to = c('Date', '.value'), 
               names_sep = '_', names_transform = list(Date = dmy)) %>%
  arrange(category, Date, number)

# A tibble: 54 x 5
   category number Date        blue green
   <chr>     <int> <date>     <int> <int>
 1 G             1 2012-07-02     1     0
 2 G             2 2012-07-02     2    99
 3 G             3 2012-07-02     1     1
 4 G             4 2012-07-02     1     1
 5 G             5 2012-07-02     1     0
 6 G             6 2012-07-02     1    99
 7 G             7 2012-07-02     1     0
 8 G             8 2012-07-02     1     1
 9 G             9 2012-07-02     1     1
10 G             1 2012-07-03     1     0
# ... with 44 more rows

这是一个 tidyverse 解决方案,可以处理重复的列名(如 blue),但不依赖于拼接这些名称:

解决方案

首先导入tidyverse并找到Excel文件:

# Load the tidyverse.
library(tidyverse)


# Filepath to the Excel file.
filepath <- "reprex.xlsx"

然后阅读 Excel 文件的三个相关部分:日期行(最上面)、header(名称重复)和数据集。

# Extract the date row and fill in the blanks.
dates <- readxl::read_excel(path = filepath, col_names = FALSE, skip = 0, n_max = 1) %>%
  # Convert everything to dates where possible; leave blanks (NAs) elsewhere.
  mutate(across(.cols = everything(), .fns = lubridate::as_datetime)) %>%
  # Treat date row as a column.
  as.double() %>% lubridate::as_datetime() %>% as_tibble() %>%
  # Fill in the blanks with the preceding dates.
  fill(1, .direction = "down") %>%
  # Treat the result as a vector of dates.
  .[[1]]


# Extract the header...
names <- readxl::read_excel(path = filepath, col_names = FALSE, skip = 1, n_max = 1) %>%
  # ...as a vector of column names (with duplicates).
  as.character()


# Extract the (unnamed) dataset.
df <- readxl::read_excel(path = filepath, col_names = FALSE, skip = 2, n_max = Inf)

最后,使用此工作流程正确命名和透视数据。

# Cut out the headers from the data.
df <- df %>%
  # Properly name the dataset.
  set_names(nm = names) %>%
  
  # Pivot the color columns.
  pivot_longer(cols = !c(category, number), names_to = "color") %>%

  # Convert to the proper datatypes.
  mutate(
    category = as.character(category),
    number = as.integer(number),
    value = as.numeric(value)
  ) %>%
  
  # Identify each "clump" of colors by the one row from which it originated;
  # where {'category', 'number'} uniquely identify each such row.
  group_by(category, number) %>%
  # Map the date names to each clump.
  mutate(
    # Index the entries in each clump.
    date = row_number(),
    # Map each date to its corresponding entry.
    date = dates[!is.na(dates)][date],
    # Ensure homogeneity as date objects.
    date = lubridate::as_datetime(date)
  ) %>% ungroup() %>%
  
  # Pivot the colors into consolidated columns: one for each color.
  pivot_wider(names_from = color, values_from = value) %>%
  
  # Sort as desired.
  arrange(date, category, number)

结果

给定一个 reprex.xlsx 就像你描述的那样 here

when I import my excel .xlsx file instead of a .csv file, the dates become numbers (e.g. 41092)

此解决方案应为 df 产生以下结果:

# A tibble: 54 x 5
   category number date                 blue green
   <chr>     <int> <dttm>              <dbl> <dbl>
 1 G             1 2012-07-02 00:00:00     1     0
 2 G             2 2012-07-02 00:00:00     2    99
 3 G             3 2012-07-02 00:00:00     1     1
 4 G             4 2012-07-02 00:00:00     1     1
 5 G             5 2012-07-02 00:00:00     1     0
 6 G             6 2012-07-02 00:00:00     1    99
 7 G             7 2012-07-02 00:00:00     1     0
 8 G             8 2012-07-02 00:00:00     1     1
 9 G             9 2012-07-02 00:00:00     1     1
10 H             1 2012-07-02 00:00:00     1     1
11 H             2 2012-07-02 00:00:00     1    99
12 H             3 2012-07-02 00:00:00     1     1
13 H             4 2012-07-02 00:00:00     1    99
14 H             5 2012-07-02 00:00:00     1     1
15 H             6 2012-07-02 00:00:00     1     0
16 H             7 2012-07-02 00:00:00     1     1
17 H             8 2012-07-02 00:00:00     2     0
18 H             9 2012-07-02 00:00:00     2     0
19 G             1 2012-07-03 00:00:00     1     0
20 G             2 2012-07-03 00:00:00     2    99
21 G             3 2012-07-03 00:00:00     1    99
22 G             4 2012-07-03 00:00:00     1     1
23 G             5 2012-07-03 00:00:00     1     0
24 G             6 2012-07-03 00:00:00     1     1
25 G             7 2012-07-03 00:00:00     1     0
26 G             8 2012-07-03 00:00:00     1     1
27 G             9 2012-07-03 00:00:00     1     1
28 H             1 2012-07-03 00:00:00     1     1
29 H             2 2012-07-03 00:00:00     1     0
30 H             3 2012-07-03 00:00:00     1     1
31 H             4 2012-07-03 00:00:00     1     2
32 H             5 2012-07-03 00:00:00     1     1
33 H             6 2012-07-03 00:00:00     1     0
34 H             7 2012-07-03 00:00:00     2     1
35 H             8 2012-07-03 00:00:00     2     0
36 H             9 2012-07-03 00:00:00     2     0
37 G             1 2012-07-04 00:00:00     1     0
38 G             2 2012-07-04 00:00:00     1    99
39 G             3 2012-07-04 00:00:00     1    99
40 G             4 2012-07-04 00:00:00     2    99
41 G             5 2012-07-04 00:00:00     1    99
42 G             6 2012-07-04 00:00:00     1    99
43 G             7 2012-07-04 00:00:00     1     0
44 G             8 2012-07-04 00:00:00     1    99
45 G             9 2012-07-04 00:00:00     1     1
46 H             1 2012-07-04 00:00:00     1     1
47 H             2 2012-07-04 00:00:00     1     0
48 H             3 2012-07-04 00:00:00     1    99
49 H             4 2012-07-04 00:00:00     1    99
50 H             5 2012-07-04 00:00:00     1     1
51 H             6 2012-07-04 00:00:00     1    99
52 H             7 2012-07-04 00:00:00     1    99
53 H             8 2012-07-04 00:00:00     1     1
54 H             9 2012-07-04 00:00:00     1     1

备注

很像这里的 openxlsx::convertToDate(), the readxl 函数自动将 Excel 日期数字转换为正确的 R Dates。

使用 resahpe

的基础 R 选项
u <- type.convert(setNames(df[-1, ], df[1, ]), as.is = TRUE)
transform(
  reshape(
    cbind(
      u[1:2],
      setNames(
        u[-c(1:2)],
        paste0(
          names(u)[-c(1:2)],
          ".",
          ave(seq(length(u) - 2), names(u)[-c(1:2)], FUN = seq_along)
        )
      )
    ),
    direction = "long",
    idvar = c("category", "number"),
    varying = -c(1:2),
    timevar = "date"
  ),
  date = Filter(nchar, names(df))[date]
)

给予

      category number      date blue green
G.1.1        G      1 02.Jul.12    1     0
G.2.1        G      2 02.Jul.12    2    99
G.3.1        G      3 02.Jul.12    1     1
G.4.1        G      4 02.Jul.12    1     1
G.5.1        G      5 02.Jul.12    1     0
G.6.1        G      6 02.Jul.12    1    99
G.7.1        G      7 02.Jul.12    1     0
G.8.1        G      8 02.Jul.12    1     1
G.9.1        G      9 02.Jul.12    1     1
H.1.1        H      1 02.Jul.12    1     1
H.2.1        H      2 02.Jul.12    1    99
H.3.1        H      3 02.Jul.12    1     1
H.4.1        H      4 02.Jul.12    1    99
H.5.1        H      5 02.Jul.12    1     1
H.6.1        H      6 02.Jul.12    1     0
H.7.1        H      7 02.Jul.12    1     1
H.8.1        H      8 02.Jul.12    2     0
H.9.1        H      9 02.Jul.12    2     0
G.1.2        G      1 03.Jul.12    1     0
G.2.2        G      2 03.Jul.12    2    99
G.3.2        G      3 03.Jul.12    1    99
G.4.2        G      4 03.Jul.12    1     1
G.5.2        G      5 03.Jul.12    1     0
G.6.2        G      6 03.Jul.12    1     1
G.7.2        G      7 03.Jul.12    1     0
G.8.2        G      8 03.Jul.12    1     1
G.9.2        G      9 03.Jul.12    1     1
H.1.2        H      1 03.Jul.12    1     1
H.2.2        H      2 03.Jul.12    1     0
H.3.2        H      3 03.Jul.12    1     1
H.4.2        H      4 03.Jul.12    1     2
H.5.2        H      5 03.Jul.12    1     1
H.6.2        H      6 03.Jul.12    1     0
H.7.2        H      7 03.Jul.12    2     1
H.8.2        H      8 03.Jul.12    2     0
H.9.2        H      9 03.Jul.12    2     0
G.1.3        G      1 04.Jul.12    1     0
G.2.3        G      2 04.Jul.12    1    99
G.3.3        G      3 04.Jul.12    1    99
G.4.3        G      4 04.Jul.12    2    99
G.5.3        G      5 04.Jul.12    1    99
G.6.3        G      6 04.Jul.12    1    99
G.7.3        G      7 04.Jul.12    1     0
G.8.3        G      8 04.Jul.12    1    99
G.9.3        G      9 04.Jul.12    1     1
H.1.3        H      1 04.Jul.12    1     1
H.2.3        H      2 04.Jul.12    1     0
H.3.3        H      3 04.Jul.12    1    99
H.4.3        H      4 04.Jul.12    1    99
H.5.3        H      5 04.Jul.12    1     1
H.6.3        H      6 04.Jul.12    1    99
H.7.3        H      7 04.Jul.12    1    99
H.8.3        H      8 04.Jul.12    1     1
H.9.3        H      9 04.Jul.12    1     1

这是另一个使用 dplyrtidyr 的解决方案。我们将首先组合 headers 和 subheaders 然后旋转数据框。我们将进行两个旋转操作:首先将所有内容收集到 datename(由“蓝色”或“绿色”组成)和 value(由“蓝色”的相应值组成)和“绿色”);然后,pivot_wider namevalue 列。 df 直接来自您的 excel 代表。

library(dplyr)
library(tidyr)

nms1 <- tidyr:::fillDown(na_if(names(df), ""))
nms2 <- unlist(df[1L, ])
df[-1L, ] %>% 
  setNames(if_else(is.na(nms1), nms2, paste(nms1, nms2, sep = "_"))) %>% 
  pivot_longer(-c(category, number), c("date", "name"), names_sep = "_") %>% 
  pivot_wider()

输出

# A tibble: 54 x 5
   category number date      blue  green
   <chr>    <chr>  <chr>     <chr> <chr>
 1 G        1      02.Jul.12 1     0    
 2 G        1      03.Jul.12 1     0    
 3 G        1      04.Jul.12 1     0    
 4 G        2      02.Jul.12 2     99   
 5 G        2      03.Jul.12 2     99   
 6 G        2      04.Jul.12 1     99   
 7 G        3      02.Jul.12 1     1    
 8 G        3      03.Jul.12 1     99   
 9 G        3      04.Jul.12 1     99   
10 G        4      02.Jul.12 1     1    
# ... with 44 more rows