R日期作为包含重复值的列名(需要保留原始日期)
R dates as column names containing duplicate values (need to retain original date)
我有一个要整理的数据集。我在 read.xlsx 文件中读到,包含在 header 中的是日期值,即使在 gather/spread 数据复制时我也需要保留它们的值。
数据集如下所示。 excel 中的日期读入为数字(这很好)问题是可能有重复的日期(例如 43693),我需要保留它们的原始值。
Date 43693 43686 43686 43714 43693
1 Contract 111 222 333 444 555
2 Org1 NR NB NR NB P
3 Org2 P P P NB NR
4 Org3 NB NB NB NB P
当我尝试转换数据时出现重名错误。
最终我试图通过这样的数据来获得,其中日期值保留任何重复项(例如 43693)
Date Contract ORG status
1 43693 111 Org1 NR
2 43493 555 Org1 P
3 43686 111 Org2 P
这里有一个 df 的例子来测试:
df <- structure(
list(
Date = c("Contract", "Org1", "Org2", "Org3", "Org4"),
'12/16/18' = c("111", "pending", "complete", "complete",
"pending"),
'12/16/18' = c("222", "pending", "complete", "pending",
"complete"),
'1/18/18' = c("222", "pending", "complete", "pending",
"complete") ),
class = "data.frame",
.Names = c("Date", "12/16/18", "12/16/18",'1/18/18'),
row.names = c(NA, -5L)
)
你有两行 header ,这很乱。我建议 re-reading 数据,跳过日期行,然后将日期行合并为列名称的一部分。
如果你已经读入了数据,你可以试试这样:
library(data.table)
df2 <- setDT(df[-1, ])
setnames(df2, c("Org", paste(names(df), unlist(df[1, ], use.names = FALSE), sep = "_")[-1]))
# Current data
df2
# Org 12/16/18_111 12/16/18_222 1/18/18_222
# 1: Org1 pending pending pending
# 2: Org2 complete complete complete
# 3: Org3 complete pending pending
# 4: Org4 pending complete complete
# melt and split
melt(df2, id.vars="Org")[, c("Date", "Contract") := tstrsplit(variable, "_")][, variable := NULL][]
# Org value Date Contract
# 1: Org1 pending 12/16/18 111
# 2: Org2 complete 12/16/18 111
# 3: Org3 complete 12/16/18 111
# 4: Org4 pending 12/16/18 111
# 5: Org1 pending 12/16/18 222
# 6: Org2 complete 12/16/18 222
# 7: Org3 pending 12/16/18 222
# 8: Org4 complete 12/16/18 222
# 9: Org1 pending 1/18/18 222
# 10: Org2 complete 1/18/18 222
# 11: Org3 pending 1/18/18 222
# 12: Org4 complete 1/18/18 222
如果您确实想坚持使用 dplyr
和 tidyr
,这里是对以上内容的翻译:
library(dplyr)
library(tidyr)
setNames(df, c("Org", paste(names(df), unlist(df[1, ], use.names = FALSE), sep = "_")[-1])) %>%
slice(-1) %>%
pivot_longer(-Org) %>%
separate(name, into = c("Date", "Contract"), sep = "_")
请注意,在开始将其他命令链接在一起之前,您必须重命名数据集。
确实,具有重复的列名是一个非常糟糕的主意。列为 headers 的日期也感觉有问题。如果您有机会更改原始数据以避免这些问题,请这样做。
还有一种方法:读取重名的数据,将这些列名保存在一行中,转置数据框,然后将之前保存的行转换为新数据框中的列。最后,使用 tidyr
pivot_longer
创建一个长数据框。不是一个优雅的解决方案...
library(dplyr)
library(tidyr)
# create the data
df <- data.frame(
Date = c("Contract", "Org1", "Org2", "Org3", "Org4"),
'12/16/18' = c("111", "pending", "complete", "complete", "pending"),
'12/16/18' = c("222", "pending", "complete", "pending", "complete"),
'1/18/18' = c("333", "pending", "complete", "pending", "complete"),
stringsAsFactors = FALSE,
check.names = FALSE
)
header <- colnames(df) # store column names
colnames(df) <- paste0("V", 1:ncol(df)) #rename columns with unique names
df[nrow(df) + 1, ] <- header # add original columns names as a row in df
df2 <- as.data.frame(t(df), stringsAsFactors = FALSE) # transpose and convert to df
names(df2) <- t(df2[1, ]) # rename the columns of the new df
df2 <- df2[-1, ] # remove first row
df3 <- df2 %>% # pivot the df to long shape
pivot_longer(cols = contains("Org"),
names_to = "ORG",
values_to = "Status")
有了这个输出:
> df3
# A tibble: 12 x 4
Contract Date ORG Status
* <chr> <chr> <chr> <chr>
1 111 12/16/18 Org1 pending
2 111 12/16/18 Org2 complete
3 111 12/16/18 Org3 complete
4 111 12/16/18 Org4 pending
5 222 12/16/18 Org1 pending
6 222 12/16/18 Org2 complete
7 222 12/16/18 Org3 pending
8 222 12/16/18 Org4 complete
9 333 1/18/18 Org1 pending
10 333 1/18/18 Org2 complete
11 333 1/18/18 Org3 pending
12 333 1/18/18 Org4 complete
我有一个要整理的数据集。我在 read.xlsx 文件中读到,包含在 header 中的是日期值,即使在 gather/spread 数据复制时我也需要保留它们的值。
数据集如下所示。 excel 中的日期读入为数字(这很好)问题是可能有重复的日期(例如 43693),我需要保留它们的原始值。
Date 43693 43686 43686 43714 43693
1 Contract 111 222 333 444 555
2 Org1 NR NB NR NB P
3 Org2 P P P NB NR
4 Org3 NB NB NB NB P
当我尝试转换数据时出现重名错误。
最终我试图通过这样的数据来获得,其中日期值保留任何重复项(例如 43693)
Date Contract ORG status
1 43693 111 Org1 NR
2 43493 555 Org1 P
3 43686 111 Org2 P
这里有一个 df 的例子来测试:
df <- structure(
list(
Date = c("Contract", "Org1", "Org2", "Org3", "Org4"),
'12/16/18' = c("111", "pending", "complete", "complete",
"pending"),
'12/16/18' = c("222", "pending", "complete", "pending",
"complete"),
'1/18/18' = c("222", "pending", "complete", "pending",
"complete") ),
class = "data.frame",
.Names = c("Date", "12/16/18", "12/16/18",'1/18/18'),
row.names = c(NA, -5L)
)
你有两行 header ,这很乱。我建议 re-reading 数据,跳过日期行,然后将日期行合并为列名称的一部分。
如果你已经读入了数据,你可以试试这样:
library(data.table)
df2 <- setDT(df[-1, ])
setnames(df2, c("Org", paste(names(df), unlist(df[1, ], use.names = FALSE), sep = "_")[-1]))
# Current data
df2
# Org 12/16/18_111 12/16/18_222 1/18/18_222
# 1: Org1 pending pending pending
# 2: Org2 complete complete complete
# 3: Org3 complete pending pending
# 4: Org4 pending complete complete
# melt and split
melt(df2, id.vars="Org")[, c("Date", "Contract") := tstrsplit(variable, "_")][, variable := NULL][]
# Org value Date Contract
# 1: Org1 pending 12/16/18 111
# 2: Org2 complete 12/16/18 111
# 3: Org3 complete 12/16/18 111
# 4: Org4 pending 12/16/18 111
# 5: Org1 pending 12/16/18 222
# 6: Org2 complete 12/16/18 222
# 7: Org3 pending 12/16/18 222
# 8: Org4 complete 12/16/18 222
# 9: Org1 pending 1/18/18 222
# 10: Org2 complete 1/18/18 222
# 11: Org3 pending 1/18/18 222
# 12: Org4 complete 1/18/18 222
如果您确实想坚持使用 dplyr
和 tidyr
,这里是对以上内容的翻译:
library(dplyr)
library(tidyr)
setNames(df, c("Org", paste(names(df), unlist(df[1, ], use.names = FALSE), sep = "_")[-1])) %>%
slice(-1) %>%
pivot_longer(-Org) %>%
separate(name, into = c("Date", "Contract"), sep = "_")
请注意,在开始将其他命令链接在一起之前,您必须重命名数据集。
确实,具有重复的列名是一个非常糟糕的主意。列为 headers 的日期也感觉有问题。如果您有机会更改原始数据以避免这些问题,请这样做。
还有一种方法:读取重名的数据,将这些列名保存在一行中,转置数据框,然后将之前保存的行转换为新数据框中的列。最后,使用 tidyr
pivot_longer
创建一个长数据框。不是一个优雅的解决方案...
library(dplyr)
library(tidyr)
# create the data
df <- data.frame(
Date = c("Contract", "Org1", "Org2", "Org3", "Org4"),
'12/16/18' = c("111", "pending", "complete", "complete", "pending"),
'12/16/18' = c("222", "pending", "complete", "pending", "complete"),
'1/18/18' = c("333", "pending", "complete", "pending", "complete"),
stringsAsFactors = FALSE,
check.names = FALSE
)
header <- colnames(df) # store column names
colnames(df) <- paste0("V", 1:ncol(df)) #rename columns with unique names
df[nrow(df) + 1, ] <- header # add original columns names as a row in df
df2 <- as.data.frame(t(df), stringsAsFactors = FALSE) # transpose and convert to df
names(df2) <- t(df2[1, ]) # rename the columns of the new df
df2 <- df2[-1, ] # remove first row
df3 <- df2 %>% # pivot the df to long shape
pivot_longer(cols = contains("Org"),
names_to = "ORG",
values_to = "Status")
有了这个输出:
> df3
# A tibble: 12 x 4
Contract Date ORG Status
* <chr> <chr> <chr> <chr>
1 111 12/16/18 Org1 pending
2 111 12/16/18 Org2 complete
3 111 12/16/18 Org3 complete
4 111 12/16/18 Org4 pending
5 222 12/16/18 Org1 pending
6 222 12/16/18 Org2 complete
7 222 12/16/18 Org3 pending
8 222 12/16/18 Org4 complete
9 333 1/18/18 Org1 pending
10 333 1/18/18 Org2 complete
11 333 1/18/18 Org3 pending
12 333 1/18/18 Org4 complete