旋转数据框以在 R 中保留列标题和 sub-headings
Pivot dataframe to keep column headings and sub-headings in R
我正在尝试旋转具有标题和 sub-headings 的 table,以便标题进入“日期”列,而副标题是两列而不是重复。
这是我的数据示例。
这是使用 dput()
生成的,因此在原始 excel 文件中,每个日期都跨越 sub-headings(“蓝色”和“绿色”),一次在 R 中,这些空白单元格是 re-named X.1、X.2 等
table <- " X X.1 X02.Jul.12 X.2 X03.Jul.12 X.3 X04.Jul.12 X.4
1 category number blue green blue green blue green
2 G 1 1 0 1 0 1 0
3 G 2 2 99 2 99 1 99
4 G 3 1 1 1 99 1 99
5 G 4 1 1 1 1 2 99
6 G 5 1 0 1 0 1 99
7 G 6 1 99 1 1 1 99
8 G 7 1 0 1 0 1 0
9 G 8 1 1 1 1 1 99
10 G 9 1 1 1 1 1 1
11 H 1 1 1 1 1 1 1
12 H 2 1 99 1 0 1 0
13 H 3 1 1 1 1 1 99
14 H 4 1 99 1 2 1 99
15 H 5 1 1 1 1 1 1
16 H 6 1 0 1 0 1 99
17 H 7 1 1 2 1 1 99
18 H 8 2 0 2 0 1 1
19 H 9 2 0 2 0 1 1"
#Create a dataframe with the above table
df <- read.table(text=table, header = TRUE)
df
这是 Excel 中的示例:
这是我想要实现的期望输出:
虽然这可以在 Excel 中手动完成,但我有多个超过 100 个的文件 dates/columns,所以我更愿意找到一种在 R 中清理它的方法。
如有任何帮助,我们将不胜感激!
Excel Reprex
这是数据集的代表,就好像它是从 Excel 读取的,没有更正名称:
# Define the dataset.
df_excel <- structure(
list(
c("category", "G", "G", "G", "G", "G", "G", "G", "G", "G", "H", "H", "H", "H", "H", "H", "H", "H", "H"),
c("number", "1", "2", "3", "4", "5", "6", "7", "8", "9", "1", "2", "3", "4", "5", "6", "7", "8", "9"),
`02.Jul.12` = c("blue", "1", "2", "1", "1", "1", "1", "1", "1", "1", "1", "1", "1", "1", "1", "1", "1", "2", "2"),
c("green", "0", "99", "1", "1", "0", "99", "0", "1", "1", "1", "99", "1", "99", "1", "0", "1", "0", "0"),
`03.Jul.12` = c("blue", "1", "2", "1", "1", "1", "1", "1", "1", "1", "1", "1", "1", "1", "1", "1", "2", "2", "2"),
c("green", "0", "99", "99", "1", "0", "1", "0", "1", "1", "1", "0", "1", "2", "1", "0", "1", "0", "0"),
`04.Jul.12` = c("blue", "1", "1", "1", "2", "1", "1", "1", "1", "1", "1", "1", "1", "1", "1", "1", "1", "1", "1"),
c("green", "0", "99", "99", "99", "99", "99", "0", "99", "1", "1", "0", "99", "99", "1", "99", "99", "1", "1")
),
class = "data.frame",
row.names = c("1", "2", "3", "4", "5", "6", "7", "8", "9", "10", "11", "12", "13", "14", "15", "16", "17", "18", "19")
)
# Save dataset in Excel file ('reprex.xlsx') for reproducibility.
openxlsx::write.xlsx(x = df_excel, file = "./reprex.xlsx")
以下代码应该会产生您想要的输出,尽管其他人可能有更优雅的解决方案:
#Create a dataframe with the above table
df <- read.table(text=table, header = TRUE)
df
names(df) <- df[1,]
library(lubridate); library(tidyr)
startdate <- dmy("02-Jul-12")
for (i in seq(3, ncol(df), by = 2)){
names(df)[i:(i+1)] <- paste0(startdate, ":", names(df)[i:(i+1)])
startdate <- startdate+1
}
df.tdy <- df[-1,] %>% pivot_longer(-c("category","number"), names_to = "datecol", values_to = "value") %>%
separate(datecol, c("date","color"), sep = ":") %>%
pivot_wider(names_from = "color") %>%
arrange(date,category,number)
# category number date blue green
# <chr> <chr> <chr> <chr> <chr>
# 1 G 1 2012-07-02 1 0
# 2 G 2 2012-07-02 2 99
# 3 G 3 2012-07-02 1 1
# 4 G 4 2012-07-02 1 1
# 5 G 5 2012-07-02 1 0
# 6 G 6 2012-07-02 1 99
# 7 G 7 2012-07-02 1 0
# 8 G 8 2012-07-02 1 1
# 9 G 9 2012-07-02 1 1
# 10 H 1 2012-07-02 1 1
这是另一个使用基数 R 和 tidyverse
组合的选项。在这里,我首先通过在左侧列(即“绿色”)的列名称中包含日期来清理列名称,以便每列都有一个日期。然后,我将 header 与sub-heading,前 2 列除外(即 category
和 number
)。然后,我删除第一行并转为长格式,其中日期和颜色在一列中留在他们自己的专栏中。
library(tidyverse)
colnames(df)[seq(2, ncol(df), 2)] <- colnames(df)[seq(1, ncol(df), 2)]
colnames(df) <-
c(df[1, 1], df[1, 2], paste(sep = '_', colnames(df)[3:ncol(df)], as.character(unlist(df[1, 3:ncol(df)]))))
df %>%
slice(-1) %>%
pivot_longer(-c(category, number),
names_to = c("Date", ".value"),
names_sep = "_") %>%
arrange(Date, category, number) %>%
mutate(Date = dmy(Date))
输出
# A tibble: 54 × 5
category number Date blue green
<chr> <chr> <date> <chr> <chr>
1 G 1 2012-07-02 1 0
2 G 2 2012-07-02 2 99
3 G 3 2012-07-02 1 1
4 G 4 2012-07-02 1 1
5 G 5 2012-07-02 1 0
6 G 6 2012-07-02 1 99
7 G 7 2012-07-02 1 0
8 G 8 2012-07-02 1 1
9 G 9 2012-07-02 1 1
10 H 1 2012-07-02 1 1
# … with 44 more rows
数据
df <- structure(
list(
c("category", "G", "G", "G", "G", "G", "G", "G", "G", "G", "H", "H", "H", "H", "H", "H", "H", "H", "H"),
c("number", "1", "2", "3", "4", "5", "6", "7", "8", "9", "1", "2", "3", "4", "5", "6", "7", "8", "9"),
`02.Jul.12` = c("blue", "1", "2", "1", "1", "1", "1", "1", "1", "1", "1", "1", "1", "1", "1", "1", "1", "2", "2"),
c("green", "0", "99", "1", "1", "0", "99", "0", "1", "1", "1", "99", "1", "99", "1", "0", "1", "0", "0"),
`03.Jul.12` = c("blue", "1", "2", "1", "1", "1", "1", "1", "1", "1", "1", "1", "1", "1", "1", "1", "2", "2", "2"),
c("green", "0", "99", "99", "1", "0", "1", "0", "1", "1", "1", "0", "1", "2", "1", "0", "1", "0", "0"),
`04.Jul.12` = c("blue", "1", "1", "1", "2", "1", "1", "1", "1", "1", "1", "1", "1", "1", "1", "1", "1", "1", "1"),
c("green", "0", "99", "99", "99", "99", "99", "0", "99", "1", "1", "0", "99", "99", "1", "99", "99", "1", "1")
),
class = "data.frame",
row.names = c("1", "2", "3", "4", "5", "6", "7", "8", "9", "10", "11", "12", "13", "14", "15", "16", "17", "18", "19")
)
如果您有其他列(例如颜色),则可以调整 colnames
的替换方式。因此,我首先创建一个序列 (a
),从第一个日期列(即 3)开始到最后一列。然后,我从 a
创建 2 个序列,其中 b
具有空列名称的索引(不包括前 2 列),c
具有日期列名称。然后,我将日期复制两次,以便它们可以替换每个日期的 2 个空列名称(green
和 red
)。然后,在下一步中,我只复制前两列名称(即 category
和 number
),然后将其余标题(即日期)粘贴到 sub-heading。然后,流程同上。
a <- seq(3, ncol(df2))
b <- a[!(a%%3==0)]
c <- a[(a%%3==0)]
colnames(df2)[b] <- colnames(df2)[sort(rep(c, 2))]
colnames(df2) <-
c(df2[1, 1], df2[1, 2], paste(sep = '_', colnames(df2)[3:ncol(df2)], as.character(unlist(df2[1, 3:ncol(df2)]))))
df2 %>%
slice(-1) %>%
pivot_longer(-c(category, number),
names_to = c("Date", ".value"),
names_sep = "_") %>%
arrange(Date, category, number) %>%
mutate(Date = lubridate::dmy(Date))
输出
# A tibble: 54 × 6
category number Date blue green red
<chr> <chr> <date> <chr> <chr> <chr>
1 G 1 2012-07-02 1 0 1
2 G 2 2012-07-02 2 99 2
3 G 3 2012-07-02 1 1 1
4 G 4 2012-07-02 1 1 1
5 G 5 2012-07-02 1 0 1
6 G 6 2012-07-02 1 99 1
7 G 7 2012-07-02 1 0 1
8 G 8 2012-07-02 1 1 1
9 G 9 2012-07-02 1 1 1
10 H 1 2012-07-02 1 1 1
# … with 44 more rows
数据
df2 <- structure(
list(
c("category", "G", "G", "G", "G", "G", "G", "G", "G", "G", "H", "H", "H", "H", "H", "H", "H", "H", "H"),
c("number", "1", "2", "3", "4", "5", "6", "7", "8", "9", "1", "2", "3", "4", "5", "6", "7", "8", "9"),
`02.Jul.12` = c("blue", "1", "2", "1", "1", "1", "1", "1", "1", "1", "1", "1", "1", "1", "1", "1", "1", "2", "2"),
c("green", "0", "99", "1", "1", "0", "99", "0", "1", "1", "1", "99", "1", "99", "1", "0", "1", "0", "0"),
c("red", "1", "2", "1", "1", "1", "1", "1", "1", "1", "1", "1", "1", "1", "1", "1", "1", "2", "2"),
`03.Jul.12` = c("blue", "1", "2", "1", "1", "1", "1", "1", "1", "1", "1", "1", "1", "1", "1", "1", "2", "2", "2"),
c("green", "0", "99", "99", "1", "0", "1", "0", "1", "1", "1", "0", "1", "2", "1", "0", "1", "0", "0"),
c("red", "1", "2", "1", "1", "1", "1", "1", "1", "1", "1", "1", "1", "1", "1", "1", "1", "2", "2"),
`04.Jul.12` = c("blue", "1", "1", "1", "2", "1", "1", "1", "1", "1", "1", "1", "1", "1", "1", "1", "1", "1", "1"),
c("green", "0", "99", "99", "99", "99", "99", "0", "99", "1", "1", "0", "99", "99", "1", "99", "99", "1", "1"),
c("red", "1", "2", "1", "1", "1", "1", "1", "1", "1", "1", "1", "1", "1", "1", "1", "1", "2", "2")
),
class = "data.frame",
row.names = c("1", "2", "3", "4", "5", "6", "7", "8", "9", "10", "11", "12", "13", "14", "15", "16", "17", "18", "19")
)
df %>%
set_names(enframe(unlist(df[1,])) %>%
mutate(name = na_if(name, ''))%>%
fill(name)%>%
transmute(nms = coalesce(str_c(name, value, sep='_'), value)) %>%
pull(nms)) %>%
slice(-1)%>%
type.convert(as.is = TRUE)%>%
pivot_longer(-c(category, number), names_to = c('Date', '.value'),
names_sep = '_', names_transform = list(Date = dmy)) %>%
arrange(category, Date, number)
# A tibble: 54 x 5
category number Date blue green
<chr> <int> <date> <int> <int>
1 G 1 2012-07-02 1 0
2 G 2 2012-07-02 2 99
3 G 3 2012-07-02 1 1
4 G 4 2012-07-02 1 1
5 G 5 2012-07-02 1 0
6 G 6 2012-07-02 1 99
7 G 7 2012-07-02 1 0
8 G 8 2012-07-02 1 1
9 G 9 2012-07-02 1 1
10 G 1 2012-07-03 1 0
# ... with 44 more rows
这是一个 tidyverse
解决方案,可以处理重复的列名(如 blue
),但不依赖于拼接这些名称:
解决方案
首先导入tidyverse
并找到Excel文件:
# Load the tidyverse.
library(tidyverse)
# Filepath to the Excel file.
filepath <- "reprex.xlsx"
然后阅读 Excel 文件的三个相关部分:日期行(最上面)、header(名称重复)和数据集。
# Extract the date row and fill in the blanks.
dates <- readxl::read_excel(path = filepath, col_names = FALSE, skip = 0, n_max = 1) %>%
# Convert everything to dates where possible; leave blanks (NAs) elsewhere.
mutate(across(.cols = everything(), .fns = lubridate::as_datetime)) %>%
# Treat date row as a column.
as.double() %>% lubridate::as_datetime() %>% as_tibble() %>%
# Fill in the blanks with the preceding dates.
fill(1, .direction = "down") %>%
# Treat the result as a vector of dates.
.[[1]]
# Extract the header...
names <- readxl::read_excel(path = filepath, col_names = FALSE, skip = 1, n_max = 1) %>%
# ...as a vector of column names (with duplicates).
as.character()
# Extract the (unnamed) dataset.
df <- readxl::read_excel(path = filepath, col_names = FALSE, skip = 2, n_max = Inf)
最后,使用此工作流程正确命名和透视数据。
# Cut out the headers from the data.
df <- df %>%
# Properly name the dataset.
set_names(nm = names) %>%
# Pivot the color columns.
pivot_longer(cols = !c(category, number), names_to = "color") %>%
# Convert to the proper datatypes.
mutate(
category = as.character(category),
number = as.integer(number),
value = as.numeric(value)
) %>%
# Identify each "clump" of colors by the one row from which it originated;
# where {'category', 'number'} uniquely identify each such row.
group_by(category, number) %>%
# Map the date names to each clump.
mutate(
# Index the entries in each clump.
date = row_number(),
# Map each date to its corresponding entry.
date = dates[!is.na(dates)][date],
# Ensure homogeneity as date objects.
date = lubridate::as_datetime(date)
) %>% ungroup() %>%
# Pivot the colors into consolidated columns: one for each color.
pivot_wider(names_from = color, values_from = value) %>%
# Sort as desired.
arrange(date, category, number)
结果
给定一个 reprex.xlsx
就像你描述的那样 here
when I import my excel .xlsx file instead of a .csv file, the dates become numbers (e.g. 41092)
此解决方案应为 df
产生以下结果:
# A tibble: 54 x 5
category number date blue green
<chr> <int> <dttm> <dbl> <dbl>
1 G 1 2012-07-02 00:00:00 1 0
2 G 2 2012-07-02 00:00:00 2 99
3 G 3 2012-07-02 00:00:00 1 1
4 G 4 2012-07-02 00:00:00 1 1
5 G 5 2012-07-02 00:00:00 1 0
6 G 6 2012-07-02 00:00:00 1 99
7 G 7 2012-07-02 00:00:00 1 0
8 G 8 2012-07-02 00:00:00 1 1
9 G 9 2012-07-02 00:00:00 1 1
10 H 1 2012-07-02 00:00:00 1 1
11 H 2 2012-07-02 00:00:00 1 99
12 H 3 2012-07-02 00:00:00 1 1
13 H 4 2012-07-02 00:00:00 1 99
14 H 5 2012-07-02 00:00:00 1 1
15 H 6 2012-07-02 00:00:00 1 0
16 H 7 2012-07-02 00:00:00 1 1
17 H 8 2012-07-02 00:00:00 2 0
18 H 9 2012-07-02 00:00:00 2 0
19 G 1 2012-07-03 00:00:00 1 0
20 G 2 2012-07-03 00:00:00 2 99
21 G 3 2012-07-03 00:00:00 1 99
22 G 4 2012-07-03 00:00:00 1 1
23 G 5 2012-07-03 00:00:00 1 0
24 G 6 2012-07-03 00:00:00 1 1
25 G 7 2012-07-03 00:00:00 1 0
26 G 8 2012-07-03 00:00:00 1 1
27 G 9 2012-07-03 00:00:00 1 1
28 H 1 2012-07-03 00:00:00 1 1
29 H 2 2012-07-03 00:00:00 1 0
30 H 3 2012-07-03 00:00:00 1 1
31 H 4 2012-07-03 00:00:00 1 2
32 H 5 2012-07-03 00:00:00 1 1
33 H 6 2012-07-03 00:00:00 1 0
34 H 7 2012-07-03 00:00:00 2 1
35 H 8 2012-07-03 00:00:00 2 0
36 H 9 2012-07-03 00:00:00 2 0
37 G 1 2012-07-04 00:00:00 1 0
38 G 2 2012-07-04 00:00:00 1 99
39 G 3 2012-07-04 00:00:00 1 99
40 G 4 2012-07-04 00:00:00 2 99
41 G 5 2012-07-04 00:00:00 1 99
42 G 6 2012-07-04 00:00:00 1 99
43 G 7 2012-07-04 00:00:00 1 0
44 G 8 2012-07-04 00:00:00 1 99
45 G 9 2012-07-04 00:00:00 1 1
46 H 1 2012-07-04 00:00:00 1 1
47 H 2 2012-07-04 00:00:00 1 0
48 H 3 2012-07-04 00:00:00 1 99
49 H 4 2012-07-04 00:00:00 1 99
50 H 5 2012-07-04 00:00:00 1 1
51 H 6 2012-07-04 00:00:00 1 99
52 H 7 2012-07-04 00:00:00 1 99
53 H 8 2012-07-04 00:00:00 1 1
54 H 9 2012-07-04 00:00:00 1 1
备注
很像这里的 openxlsx::convertToDate()
, the readxl
函数自动将 Excel 日期数字转换为正确的 R Date
s。
使用 resahpe
的基础 R 选项
u <- type.convert(setNames(df[-1, ], df[1, ]), as.is = TRUE)
transform(
reshape(
cbind(
u[1:2],
setNames(
u[-c(1:2)],
paste0(
names(u)[-c(1:2)],
".",
ave(seq(length(u) - 2), names(u)[-c(1:2)], FUN = seq_along)
)
)
),
direction = "long",
idvar = c("category", "number"),
varying = -c(1:2),
timevar = "date"
),
date = Filter(nchar, names(df))[date]
)
给予
category number date blue green
G.1.1 G 1 02.Jul.12 1 0
G.2.1 G 2 02.Jul.12 2 99
G.3.1 G 3 02.Jul.12 1 1
G.4.1 G 4 02.Jul.12 1 1
G.5.1 G 5 02.Jul.12 1 0
G.6.1 G 6 02.Jul.12 1 99
G.7.1 G 7 02.Jul.12 1 0
G.8.1 G 8 02.Jul.12 1 1
G.9.1 G 9 02.Jul.12 1 1
H.1.1 H 1 02.Jul.12 1 1
H.2.1 H 2 02.Jul.12 1 99
H.3.1 H 3 02.Jul.12 1 1
H.4.1 H 4 02.Jul.12 1 99
H.5.1 H 5 02.Jul.12 1 1
H.6.1 H 6 02.Jul.12 1 0
H.7.1 H 7 02.Jul.12 1 1
H.8.1 H 8 02.Jul.12 2 0
H.9.1 H 9 02.Jul.12 2 0
G.1.2 G 1 03.Jul.12 1 0
G.2.2 G 2 03.Jul.12 2 99
G.3.2 G 3 03.Jul.12 1 99
G.4.2 G 4 03.Jul.12 1 1
G.5.2 G 5 03.Jul.12 1 0
G.6.2 G 6 03.Jul.12 1 1
G.7.2 G 7 03.Jul.12 1 0
G.8.2 G 8 03.Jul.12 1 1
G.9.2 G 9 03.Jul.12 1 1
H.1.2 H 1 03.Jul.12 1 1
H.2.2 H 2 03.Jul.12 1 0
H.3.2 H 3 03.Jul.12 1 1
H.4.2 H 4 03.Jul.12 1 2
H.5.2 H 5 03.Jul.12 1 1
H.6.2 H 6 03.Jul.12 1 0
H.7.2 H 7 03.Jul.12 2 1
H.8.2 H 8 03.Jul.12 2 0
H.9.2 H 9 03.Jul.12 2 0
G.1.3 G 1 04.Jul.12 1 0
G.2.3 G 2 04.Jul.12 1 99
G.3.3 G 3 04.Jul.12 1 99
G.4.3 G 4 04.Jul.12 2 99
G.5.3 G 5 04.Jul.12 1 99
G.6.3 G 6 04.Jul.12 1 99
G.7.3 G 7 04.Jul.12 1 0
G.8.3 G 8 04.Jul.12 1 99
G.9.3 G 9 04.Jul.12 1 1
H.1.3 H 1 04.Jul.12 1 1
H.2.3 H 2 04.Jul.12 1 0
H.3.3 H 3 04.Jul.12 1 99
H.4.3 H 4 04.Jul.12 1 99
H.5.3 H 5 04.Jul.12 1 1
H.6.3 H 6 04.Jul.12 1 99
H.7.3 H 7 04.Jul.12 1 99
H.8.3 H 8 04.Jul.12 1 1
H.9.3 H 9 04.Jul.12 1 1
这是另一个使用 dplyr
和 tidyr
的解决方案。我们将首先组合 headers 和 subheaders 然后旋转数据框。我们将进行两个旋转操作:首先将所有内容收集到 date
、name
(由“蓝色”或“绿色”组成)和 value
(由“蓝色”的相应值组成)和“绿色”);然后,pivot_wider
name
和 value
列。 df
直接来自您的 excel 代表。
library(dplyr)
library(tidyr)
nms1 <- tidyr:::fillDown(na_if(names(df), ""))
nms2 <- unlist(df[1L, ])
df[-1L, ] %>%
setNames(if_else(is.na(nms1), nms2, paste(nms1, nms2, sep = "_"))) %>%
pivot_longer(-c(category, number), c("date", "name"), names_sep = "_") %>%
pivot_wider()
输出
# A tibble: 54 x 5
category number date blue green
<chr> <chr> <chr> <chr> <chr>
1 G 1 02.Jul.12 1 0
2 G 1 03.Jul.12 1 0
3 G 1 04.Jul.12 1 0
4 G 2 02.Jul.12 2 99
5 G 2 03.Jul.12 2 99
6 G 2 04.Jul.12 1 99
7 G 3 02.Jul.12 1 1
8 G 3 03.Jul.12 1 99
9 G 3 04.Jul.12 1 99
10 G 4 02.Jul.12 1 1
# ... with 44 more rows
我正在尝试旋转具有标题和 sub-headings 的 table,以便标题进入“日期”列,而副标题是两列而不是重复。
这是我的数据示例。
这是使用 dput()
生成的,因此在原始 excel 文件中,每个日期都跨越 sub-headings(“蓝色”和“绿色”),一次在 R 中,这些空白单元格是 re-named X.1、X.2 等
table <- " X X.1 X02.Jul.12 X.2 X03.Jul.12 X.3 X04.Jul.12 X.4
1 category number blue green blue green blue green
2 G 1 1 0 1 0 1 0
3 G 2 2 99 2 99 1 99
4 G 3 1 1 1 99 1 99
5 G 4 1 1 1 1 2 99
6 G 5 1 0 1 0 1 99
7 G 6 1 99 1 1 1 99
8 G 7 1 0 1 0 1 0
9 G 8 1 1 1 1 1 99
10 G 9 1 1 1 1 1 1
11 H 1 1 1 1 1 1 1
12 H 2 1 99 1 0 1 0
13 H 3 1 1 1 1 1 99
14 H 4 1 99 1 2 1 99
15 H 5 1 1 1 1 1 1
16 H 6 1 0 1 0 1 99
17 H 7 1 1 2 1 1 99
18 H 8 2 0 2 0 1 1
19 H 9 2 0 2 0 1 1"
#Create a dataframe with the above table
df <- read.table(text=table, header = TRUE)
df
这是 Excel 中的示例:
这是我想要实现的期望输出:
虽然这可以在 Excel 中手动完成,但我有多个超过 100 个的文件 dates/columns,所以我更愿意找到一种在 R 中清理它的方法。
如有任何帮助,我们将不胜感激!
Excel Reprex
这是数据集的代表,就好像它是从 Excel 读取的,没有更正名称:
# Define the dataset.
df_excel <- structure(
list(
c("category", "G", "G", "G", "G", "G", "G", "G", "G", "G", "H", "H", "H", "H", "H", "H", "H", "H", "H"),
c("number", "1", "2", "3", "4", "5", "6", "7", "8", "9", "1", "2", "3", "4", "5", "6", "7", "8", "9"),
`02.Jul.12` = c("blue", "1", "2", "1", "1", "1", "1", "1", "1", "1", "1", "1", "1", "1", "1", "1", "1", "2", "2"),
c("green", "0", "99", "1", "1", "0", "99", "0", "1", "1", "1", "99", "1", "99", "1", "0", "1", "0", "0"),
`03.Jul.12` = c("blue", "1", "2", "1", "1", "1", "1", "1", "1", "1", "1", "1", "1", "1", "1", "1", "2", "2", "2"),
c("green", "0", "99", "99", "1", "0", "1", "0", "1", "1", "1", "0", "1", "2", "1", "0", "1", "0", "0"),
`04.Jul.12` = c("blue", "1", "1", "1", "2", "1", "1", "1", "1", "1", "1", "1", "1", "1", "1", "1", "1", "1", "1"),
c("green", "0", "99", "99", "99", "99", "99", "0", "99", "1", "1", "0", "99", "99", "1", "99", "99", "1", "1")
),
class = "data.frame",
row.names = c("1", "2", "3", "4", "5", "6", "7", "8", "9", "10", "11", "12", "13", "14", "15", "16", "17", "18", "19")
)
# Save dataset in Excel file ('reprex.xlsx') for reproducibility.
openxlsx::write.xlsx(x = df_excel, file = "./reprex.xlsx")
以下代码应该会产生您想要的输出,尽管其他人可能有更优雅的解决方案:
#Create a dataframe with the above table
df <- read.table(text=table, header = TRUE)
df
names(df) <- df[1,]
library(lubridate); library(tidyr)
startdate <- dmy("02-Jul-12")
for (i in seq(3, ncol(df), by = 2)){
names(df)[i:(i+1)] <- paste0(startdate, ":", names(df)[i:(i+1)])
startdate <- startdate+1
}
df.tdy <- df[-1,] %>% pivot_longer(-c("category","number"), names_to = "datecol", values_to = "value") %>%
separate(datecol, c("date","color"), sep = ":") %>%
pivot_wider(names_from = "color") %>%
arrange(date,category,number)
# category number date blue green
# <chr> <chr> <chr> <chr> <chr>
# 1 G 1 2012-07-02 1 0
# 2 G 2 2012-07-02 2 99
# 3 G 3 2012-07-02 1 1
# 4 G 4 2012-07-02 1 1
# 5 G 5 2012-07-02 1 0
# 6 G 6 2012-07-02 1 99
# 7 G 7 2012-07-02 1 0
# 8 G 8 2012-07-02 1 1
# 9 G 9 2012-07-02 1 1
# 10 H 1 2012-07-02 1 1
这是另一个使用基数 R 和 tidyverse
组合的选项。在这里,我首先通过在左侧列(即“绿色”)的列名称中包含日期来清理列名称,以便每列都有一个日期。然后,我将 header 与sub-heading,前 2 列除外(即 category
和 number
)。然后,我删除第一行并转为长格式,其中日期和颜色在一列中留在他们自己的专栏中。
library(tidyverse)
colnames(df)[seq(2, ncol(df), 2)] <- colnames(df)[seq(1, ncol(df), 2)]
colnames(df) <-
c(df[1, 1], df[1, 2], paste(sep = '_', colnames(df)[3:ncol(df)], as.character(unlist(df[1, 3:ncol(df)]))))
df %>%
slice(-1) %>%
pivot_longer(-c(category, number),
names_to = c("Date", ".value"),
names_sep = "_") %>%
arrange(Date, category, number) %>%
mutate(Date = dmy(Date))
输出
# A tibble: 54 × 5
category number Date blue green
<chr> <chr> <date> <chr> <chr>
1 G 1 2012-07-02 1 0
2 G 2 2012-07-02 2 99
3 G 3 2012-07-02 1 1
4 G 4 2012-07-02 1 1
5 G 5 2012-07-02 1 0
6 G 6 2012-07-02 1 99
7 G 7 2012-07-02 1 0
8 G 8 2012-07-02 1 1
9 G 9 2012-07-02 1 1
10 H 1 2012-07-02 1 1
# … with 44 more rows
数据
df <- structure(
list(
c("category", "G", "G", "G", "G", "G", "G", "G", "G", "G", "H", "H", "H", "H", "H", "H", "H", "H", "H"),
c("number", "1", "2", "3", "4", "5", "6", "7", "8", "9", "1", "2", "3", "4", "5", "6", "7", "8", "9"),
`02.Jul.12` = c("blue", "1", "2", "1", "1", "1", "1", "1", "1", "1", "1", "1", "1", "1", "1", "1", "1", "2", "2"),
c("green", "0", "99", "1", "1", "0", "99", "0", "1", "1", "1", "99", "1", "99", "1", "0", "1", "0", "0"),
`03.Jul.12` = c("blue", "1", "2", "1", "1", "1", "1", "1", "1", "1", "1", "1", "1", "1", "1", "1", "2", "2", "2"),
c("green", "0", "99", "99", "1", "0", "1", "0", "1", "1", "1", "0", "1", "2", "1", "0", "1", "0", "0"),
`04.Jul.12` = c("blue", "1", "1", "1", "2", "1", "1", "1", "1", "1", "1", "1", "1", "1", "1", "1", "1", "1", "1"),
c("green", "0", "99", "99", "99", "99", "99", "0", "99", "1", "1", "0", "99", "99", "1", "99", "99", "1", "1")
),
class = "data.frame",
row.names = c("1", "2", "3", "4", "5", "6", "7", "8", "9", "10", "11", "12", "13", "14", "15", "16", "17", "18", "19")
)
如果您有其他列(例如颜色),则可以调整 colnames
的替换方式。因此,我首先创建一个序列 (a
),从第一个日期列(即 3)开始到最后一列。然后,我从 a
创建 2 个序列,其中 b
具有空列名称的索引(不包括前 2 列),c
具有日期列名称。然后,我将日期复制两次,以便它们可以替换每个日期的 2 个空列名称(green
和 red
)。然后,在下一步中,我只复制前两列名称(即 category
和 number
),然后将其余标题(即日期)粘贴到 sub-heading。然后,流程同上。
a <- seq(3, ncol(df2))
b <- a[!(a%%3==0)]
c <- a[(a%%3==0)]
colnames(df2)[b] <- colnames(df2)[sort(rep(c, 2))]
colnames(df2) <-
c(df2[1, 1], df2[1, 2], paste(sep = '_', colnames(df2)[3:ncol(df2)], as.character(unlist(df2[1, 3:ncol(df2)]))))
df2 %>%
slice(-1) %>%
pivot_longer(-c(category, number),
names_to = c("Date", ".value"),
names_sep = "_") %>%
arrange(Date, category, number) %>%
mutate(Date = lubridate::dmy(Date))
输出
# A tibble: 54 × 6
category number Date blue green red
<chr> <chr> <date> <chr> <chr> <chr>
1 G 1 2012-07-02 1 0 1
2 G 2 2012-07-02 2 99 2
3 G 3 2012-07-02 1 1 1
4 G 4 2012-07-02 1 1 1
5 G 5 2012-07-02 1 0 1
6 G 6 2012-07-02 1 99 1
7 G 7 2012-07-02 1 0 1
8 G 8 2012-07-02 1 1 1
9 G 9 2012-07-02 1 1 1
10 H 1 2012-07-02 1 1 1
# … with 44 more rows
数据
df2 <- structure(
list(
c("category", "G", "G", "G", "G", "G", "G", "G", "G", "G", "H", "H", "H", "H", "H", "H", "H", "H", "H"),
c("number", "1", "2", "3", "4", "5", "6", "7", "8", "9", "1", "2", "3", "4", "5", "6", "7", "8", "9"),
`02.Jul.12` = c("blue", "1", "2", "1", "1", "1", "1", "1", "1", "1", "1", "1", "1", "1", "1", "1", "1", "2", "2"),
c("green", "0", "99", "1", "1", "0", "99", "0", "1", "1", "1", "99", "1", "99", "1", "0", "1", "0", "0"),
c("red", "1", "2", "1", "1", "1", "1", "1", "1", "1", "1", "1", "1", "1", "1", "1", "1", "2", "2"),
`03.Jul.12` = c("blue", "1", "2", "1", "1", "1", "1", "1", "1", "1", "1", "1", "1", "1", "1", "1", "2", "2", "2"),
c("green", "0", "99", "99", "1", "0", "1", "0", "1", "1", "1", "0", "1", "2", "1", "0", "1", "0", "0"),
c("red", "1", "2", "1", "1", "1", "1", "1", "1", "1", "1", "1", "1", "1", "1", "1", "1", "2", "2"),
`04.Jul.12` = c("blue", "1", "1", "1", "2", "1", "1", "1", "1", "1", "1", "1", "1", "1", "1", "1", "1", "1", "1"),
c("green", "0", "99", "99", "99", "99", "99", "0", "99", "1", "1", "0", "99", "99", "1", "99", "99", "1", "1"),
c("red", "1", "2", "1", "1", "1", "1", "1", "1", "1", "1", "1", "1", "1", "1", "1", "1", "2", "2")
),
class = "data.frame",
row.names = c("1", "2", "3", "4", "5", "6", "7", "8", "9", "10", "11", "12", "13", "14", "15", "16", "17", "18", "19")
)
df %>%
set_names(enframe(unlist(df[1,])) %>%
mutate(name = na_if(name, ''))%>%
fill(name)%>%
transmute(nms = coalesce(str_c(name, value, sep='_'), value)) %>%
pull(nms)) %>%
slice(-1)%>%
type.convert(as.is = TRUE)%>%
pivot_longer(-c(category, number), names_to = c('Date', '.value'),
names_sep = '_', names_transform = list(Date = dmy)) %>%
arrange(category, Date, number)
# A tibble: 54 x 5
category number Date blue green
<chr> <int> <date> <int> <int>
1 G 1 2012-07-02 1 0
2 G 2 2012-07-02 2 99
3 G 3 2012-07-02 1 1
4 G 4 2012-07-02 1 1
5 G 5 2012-07-02 1 0
6 G 6 2012-07-02 1 99
7 G 7 2012-07-02 1 0
8 G 8 2012-07-02 1 1
9 G 9 2012-07-02 1 1
10 G 1 2012-07-03 1 0
# ... with 44 more rows
这是一个 tidyverse
解决方案,可以处理重复的列名(如 blue
),但不依赖于拼接这些名称:
解决方案
首先导入tidyverse
并找到Excel文件:
# Load the tidyverse.
library(tidyverse)
# Filepath to the Excel file.
filepath <- "reprex.xlsx"
然后阅读 Excel 文件的三个相关部分:日期行(最上面)、header(名称重复)和数据集。
# Extract the date row and fill in the blanks.
dates <- readxl::read_excel(path = filepath, col_names = FALSE, skip = 0, n_max = 1) %>%
# Convert everything to dates where possible; leave blanks (NAs) elsewhere.
mutate(across(.cols = everything(), .fns = lubridate::as_datetime)) %>%
# Treat date row as a column.
as.double() %>% lubridate::as_datetime() %>% as_tibble() %>%
# Fill in the blanks with the preceding dates.
fill(1, .direction = "down") %>%
# Treat the result as a vector of dates.
.[[1]]
# Extract the header...
names <- readxl::read_excel(path = filepath, col_names = FALSE, skip = 1, n_max = 1) %>%
# ...as a vector of column names (with duplicates).
as.character()
# Extract the (unnamed) dataset.
df <- readxl::read_excel(path = filepath, col_names = FALSE, skip = 2, n_max = Inf)
最后,使用此工作流程正确命名和透视数据。
# Cut out the headers from the data.
df <- df %>%
# Properly name the dataset.
set_names(nm = names) %>%
# Pivot the color columns.
pivot_longer(cols = !c(category, number), names_to = "color") %>%
# Convert to the proper datatypes.
mutate(
category = as.character(category),
number = as.integer(number),
value = as.numeric(value)
) %>%
# Identify each "clump" of colors by the one row from which it originated;
# where {'category', 'number'} uniquely identify each such row.
group_by(category, number) %>%
# Map the date names to each clump.
mutate(
# Index the entries in each clump.
date = row_number(),
# Map each date to its corresponding entry.
date = dates[!is.na(dates)][date],
# Ensure homogeneity as date objects.
date = lubridate::as_datetime(date)
) %>% ungroup() %>%
# Pivot the colors into consolidated columns: one for each color.
pivot_wider(names_from = color, values_from = value) %>%
# Sort as desired.
arrange(date, category, number)
结果
给定一个 reprex.xlsx
就像你描述的那样 here
when I import my excel .xlsx file instead of a .csv file, the dates become numbers (e.g. 41092)
此解决方案应为 df
产生以下结果:
# A tibble: 54 x 5
category number date blue green
<chr> <int> <dttm> <dbl> <dbl>
1 G 1 2012-07-02 00:00:00 1 0
2 G 2 2012-07-02 00:00:00 2 99
3 G 3 2012-07-02 00:00:00 1 1
4 G 4 2012-07-02 00:00:00 1 1
5 G 5 2012-07-02 00:00:00 1 0
6 G 6 2012-07-02 00:00:00 1 99
7 G 7 2012-07-02 00:00:00 1 0
8 G 8 2012-07-02 00:00:00 1 1
9 G 9 2012-07-02 00:00:00 1 1
10 H 1 2012-07-02 00:00:00 1 1
11 H 2 2012-07-02 00:00:00 1 99
12 H 3 2012-07-02 00:00:00 1 1
13 H 4 2012-07-02 00:00:00 1 99
14 H 5 2012-07-02 00:00:00 1 1
15 H 6 2012-07-02 00:00:00 1 0
16 H 7 2012-07-02 00:00:00 1 1
17 H 8 2012-07-02 00:00:00 2 0
18 H 9 2012-07-02 00:00:00 2 0
19 G 1 2012-07-03 00:00:00 1 0
20 G 2 2012-07-03 00:00:00 2 99
21 G 3 2012-07-03 00:00:00 1 99
22 G 4 2012-07-03 00:00:00 1 1
23 G 5 2012-07-03 00:00:00 1 0
24 G 6 2012-07-03 00:00:00 1 1
25 G 7 2012-07-03 00:00:00 1 0
26 G 8 2012-07-03 00:00:00 1 1
27 G 9 2012-07-03 00:00:00 1 1
28 H 1 2012-07-03 00:00:00 1 1
29 H 2 2012-07-03 00:00:00 1 0
30 H 3 2012-07-03 00:00:00 1 1
31 H 4 2012-07-03 00:00:00 1 2
32 H 5 2012-07-03 00:00:00 1 1
33 H 6 2012-07-03 00:00:00 1 0
34 H 7 2012-07-03 00:00:00 2 1
35 H 8 2012-07-03 00:00:00 2 0
36 H 9 2012-07-03 00:00:00 2 0
37 G 1 2012-07-04 00:00:00 1 0
38 G 2 2012-07-04 00:00:00 1 99
39 G 3 2012-07-04 00:00:00 1 99
40 G 4 2012-07-04 00:00:00 2 99
41 G 5 2012-07-04 00:00:00 1 99
42 G 6 2012-07-04 00:00:00 1 99
43 G 7 2012-07-04 00:00:00 1 0
44 G 8 2012-07-04 00:00:00 1 99
45 G 9 2012-07-04 00:00:00 1 1
46 H 1 2012-07-04 00:00:00 1 1
47 H 2 2012-07-04 00:00:00 1 0
48 H 3 2012-07-04 00:00:00 1 99
49 H 4 2012-07-04 00:00:00 1 99
50 H 5 2012-07-04 00:00:00 1 1
51 H 6 2012-07-04 00:00:00 1 99
52 H 7 2012-07-04 00:00:00 1 99
53 H 8 2012-07-04 00:00:00 1 1
54 H 9 2012-07-04 00:00:00 1 1
备注
很像这里的 openxlsx::convertToDate()
, the readxl
函数自动将 Excel 日期数字转换为正确的 R Date
s。
使用 resahpe
u <- type.convert(setNames(df[-1, ], df[1, ]), as.is = TRUE)
transform(
reshape(
cbind(
u[1:2],
setNames(
u[-c(1:2)],
paste0(
names(u)[-c(1:2)],
".",
ave(seq(length(u) - 2), names(u)[-c(1:2)], FUN = seq_along)
)
)
),
direction = "long",
idvar = c("category", "number"),
varying = -c(1:2),
timevar = "date"
),
date = Filter(nchar, names(df))[date]
)
给予
category number date blue green
G.1.1 G 1 02.Jul.12 1 0
G.2.1 G 2 02.Jul.12 2 99
G.3.1 G 3 02.Jul.12 1 1
G.4.1 G 4 02.Jul.12 1 1
G.5.1 G 5 02.Jul.12 1 0
G.6.1 G 6 02.Jul.12 1 99
G.7.1 G 7 02.Jul.12 1 0
G.8.1 G 8 02.Jul.12 1 1
G.9.1 G 9 02.Jul.12 1 1
H.1.1 H 1 02.Jul.12 1 1
H.2.1 H 2 02.Jul.12 1 99
H.3.1 H 3 02.Jul.12 1 1
H.4.1 H 4 02.Jul.12 1 99
H.5.1 H 5 02.Jul.12 1 1
H.6.1 H 6 02.Jul.12 1 0
H.7.1 H 7 02.Jul.12 1 1
H.8.1 H 8 02.Jul.12 2 0
H.9.1 H 9 02.Jul.12 2 0
G.1.2 G 1 03.Jul.12 1 0
G.2.2 G 2 03.Jul.12 2 99
G.3.2 G 3 03.Jul.12 1 99
G.4.2 G 4 03.Jul.12 1 1
G.5.2 G 5 03.Jul.12 1 0
G.6.2 G 6 03.Jul.12 1 1
G.7.2 G 7 03.Jul.12 1 0
G.8.2 G 8 03.Jul.12 1 1
G.9.2 G 9 03.Jul.12 1 1
H.1.2 H 1 03.Jul.12 1 1
H.2.2 H 2 03.Jul.12 1 0
H.3.2 H 3 03.Jul.12 1 1
H.4.2 H 4 03.Jul.12 1 2
H.5.2 H 5 03.Jul.12 1 1
H.6.2 H 6 03.Jul.12 1 0
H.7.2 H 7 03.Jul.12 2 1
H.8.2 H 8 03.Jul.12 2 0
H.9.2 H 9 03.Jul.12 2 0
G.1.3 G 1 04.Jul.12 1 0
G.2.3 G 2 04.Jul.12 1 99
G.3.3 G 3 04.Jul.12 1 99
G.4.3 G 4 04.Jul.12 2 99
G.5.3 G 5 04.Jul.12 1 99
G.6.3 G 6 04.Jul.12 1 99
G.7.3 G 7 04.Jul.12 1 0
G.8.3 G 8 04.Jul.12 1 99
G.9.3 G 9 04.Jul.12 1 1
H.1.3 H 1 04.Jul.12 1 1
H.2.3 H 2 04.Jul.12 1 0
H.3.3 H 3 04.Jul.12 1 99
H.4.3 H 4 04.Jul.12 1 99
H.5.3 H 5 04.Jul.12 1 1
H.6.3 H 6 04.Jul.12 1 99
H.7.3 H 7 04.Jul.12 1 99
H.8.3 H 8 04.Jul.12 1 1
H.9.3 H 9 04.Jul.12 1 1
这是另一个使用 dplyr
和 tidyr
的解决方案。我们将首先组合 headers 和 subheaders 然后旋转数据框。我们将进行两个旋转操作:首先将所有内容收集到 date
、name
(由“蓝色”或“绿色”组成)和 value
(由“蓝色”的相应值组成)和“绿色”);然后,pivot_wider
name
和 value
列。 df
直接来自您的 excel 代表。
library(dplyr)
library(tidyr)
nms1 <- tidyr:::fillDown(na_if(names(df), ""))
nms2 <- unlist(df[1L, ])
df[-1L, ] %>%
setNames(if_else(is.na(nms1), nms2, paste(nms1, nms2, sep = "_"))) %>%
pivot_longer(-c(category, number), c("date", "name"), names_sep = "_") %>%
pivot_wider()
输出
# A tibble: 54 x 5
category number date blue green
<chr> <chr> <chr> <chr> <chr>
1 G 1 02.Jul.12 1 0
2 G 1 03.Jul.12 1 0
3 G 1 04.Jul.12 1 0
4 G 2 02.Jul.12 2 99
5 G 2 03.Jul.12 2 99
6 G 2 04.Jul.12 1 99
7 G 3 02.Jul.12 1 1
8 G 3 03.Jul.12 1 99
9 G 3 04.Jul.12 1 99
10 G 4 02.Jul.12 1 1
# ... with 44 more rows