在 R 中格式化数据框
Formatting a dataframe in R
我有一项相当复杂的任务需要执行,请耐心等待。我猜这是可能的,但如果没有,请告诉我。
假设我有以下数据
set.seed(123)
date1 <- c(seq(as.Date("2011-11-1"),as.Date("2012-1-1"),by = "months"),seq(as.Date("2011-12-1"),as.Date("2012-3-1"),by = "months"))
date2 <- c(seq(as.Date("2011-12-1"),as.Date("2012-1-1"),by = "months"),seq(as.Date("2011-11-1"),as.Date("2012-1-1"),by = "months"))
variables <- c(rep("Number of Coins",3),rep("Number of Shoes",4),rep("Number of Coins",2),rep("Number of Shoes",3))
date <- c(date1,date2)
names <- c(rep("Jim",7),rep("Arnold",5))
value <- rnorm(12)
df <- data.frame(names, date, variables, value)
names date variables value
1 Jim 2011-11-01 Number of Coins -0.56047565
2 Jim 2011-12-01 Number of Coins -0.23017749
3 Jim 2012-01-01 Number of Coins 1.55870831
4 Jim 2011-12-01 Number of Shoes 0.07050839
5 Jim 2012-01-01 Number of Shoes 0.12928774
6 Jim 2012-02-01 Number of Shoes 1.71506499
7 Jim 2012-03-01 Number of Shoes 0.46091621
8 Arnold 2011-12-01 Number of Coins -1.26506123
9 Arnold 2012-01-01 Number of Coins -0.68685285
10 Arnold 2011-11-01 Number of Shoes -0.44566197
11 Arnold 2011-12-01 Number of Shoes 1.22408180
12 Arnold 2012-01-01 Number of Shoes 0.35981383
这个数据的问题是变量名占了一列。我想为 Number of Shoes
和 Number of Coins
创建两列,但我想确保日期保持不变。理想情况下,我想把这个数据框变成这个
names date Number.of.Coins Number.of.Shoes
1 Jim 11/1/11 -0.5604756 NA
2 Jim 12/1/11 -0.2301775 0.07050839
3 Jim 1/1/12 1.5587083 0.12928773
4 Jim 2/1/12 NA 1.71506499
5 Jim 3/1/12 NA 0.46091621
6 Arnold 11/1/11 NA -0.44566197
7 Arnold 12/1/11 -1.2650612 1.22408180
8 Arnold 1/1/12 -0.6868529 0.35981383
所以日期范围将是每个变量的最小日期到每个变量的最大日期。这将产生对 NA 的需求。我想在每个 name
中执行此操作。希望这是有道理的!
按照@Ajinkya Kale 的建议,您可以使用reshape2
包处理此任务。
dcast(df, names + date ~ variables, value.var = "value")
如果你想确保日期的顺序是按时间顺序排列的,你可以使用dplyr
包中的arrange()
。
arrange(dcast(df, names + date ~ variables, value.var = "value"), names, date)
# names date Number of Coins Number of Shoes
#1 Arnold 2011-11-01 NA -0.44566197
#2 Arnold 2011-12-01 -1.2650612 1.22408180
#3 Arnold 2012-01-01 -0.6868529 0.35981383
#4 Jim 2011-11-01 -0.5604756 NA
#5 Jim 2011-12-01 -0.2301775 0.07050839
#6 Jim 2012-01-01 1.5587083 0.12928774
#7 Jim 2012-02-01 NA 1.71506499
#8 Jim 2012-03-01 NA 0.46091621
另一种选择是使用 spread
来自 tidyr
library(tidyr)
spread(df, variables, value)
# names date Number of Coins Number of Shoes
#1 Arnold 2011-11-01 NA -0.44566197
#2 Arnold 2011-12-01 -1.2650612 1.22408180
#3 Arnold 2012-01-01 -0.6868529 0.35981383
#4 Jim 2011-11-01 -0.5604756 NA
#5 Jim 2011-12-01 -0.2301775 0.07050839
#6 Jim 2012-01-01 1.5587083 0.12928774
#7 Jim 2012-02-01 NA 1.71506499
#8 Jim 2012-03-01 NA 0.46091621
我有一项相当复杂的任务需要执行,请耐心等待。我猜这是可能的,但如果没有,请告诉我。
假设我有以下数据
set.seed(123)
date1 <- c(seq(as.Date("2011-11-1"),as.Date("2012-1-1"),by = "months"),seq(as.Date("2011-12-1"),as.Date("2012-3-1"),by = "months"))
date2 <- c(seq(as.Date("2011-12-1"),as.Date("2012-1-1"),by = "months"),seq(as.Date("2011-11-1"),as.Date("2012-1-1"),by = "months"))
variables <- c(rep("Number of Coins",3),rep("Number of Shoes",4),rep("Number of Coins",2),rep("Number of Shoes",3))
date <- c(date1,date2)
names <- c(rep("Jim",7),rep("Arnold",5))
value <- rnorm(12)
df <- data.frame(names, date, variables, value)
names date variables value
1 Jim 2011-11-01 Number of Coins -0.56047565
2 Jim 2011-12-01 Number of Coins -0.23017749
3 Jim 2012-01-01 Number of Coins 1.55870831
4 Jim 2011-12-01 Number of Shoes 0.07050839
5 Jim 2012-01-01 Number of Shoes 0.12928774
6 Jim 2012-02-01 Number of Shoes 1.71506499
7 Jim 2012-03-01 Number of Shoes 0.46091621
8 Arnold 2011-12-01 Number of Coins -1.26506123
9 Arnold 2012-01-01 Number of Coins -0.68685285
10 Arnold 2011-11-01 Number of Shoes -0.44566197
11 Arnold 2011-12-01 Number of Shoes 1.22408180
12 Arnold 2012-01-01 Number of Shoes 0.35981383
这个数据的问题是变量名占了一列。我想为 Number of Shoes
和 Number of Coins
创建两列,但我想确保日期保持不变。理想情况下,我想把这个数据框变成这个
names date Number.of.Coins Number.of.Shoes
1 Jim 11/1/11 -0.5604756 NA
2 Jim 12/1/11 -0.2301775 0.07050839
3 Jim 1/1/12 1.5587083 0.12928773
4 Jim 2/1/12 NA 1.71506499
5 Jim 3/1/12 NA 0.46091621
6 Arnold 11/1/11 NA -0.44566197
7 Arnold 12/1/11 -1.2650612 1.22408180
8 Arnold 1/1/12 -0.6868529 0.35981383
所以日期范围将是每个变量的最小日期到每个变量的最大日期。这将产生对 NA 的需求。我想在每个 name
中执行此操作。希望这是有道理的!
按照@Ajinkya Kale 的建议,您可以使用reshape2
包处理此任务。
dcast(df, names + date ~ variables, value.var = "value")
如果你想确保日期的顺序是按时间顺序排列的,你可以使用dplyr
包中的arrange()
。
arrange(dcast(df, names + date ~ variables, value.var = "value"), names, date)
# names date Number of Coins Number of Shoes
#1 Arnold 2011-11-01 NA -0.44566197
#2 Arnold 2011-12-01 -1.2650612 1.22408180
#3 Arnold 2012-01-01 -0.6868529 0.35981383
#4 Jim 2011-11-01 -0.5604756 NA
#5 Jim 2011-12-01 -0.2301775 0.07050839
#6 Jim 2012-01-01 1.5587083 0.12928774
#7 Jim 2012-02-01 NA 1.71506499
#8 Jim 2012-03-01 NA 0.46091621
另一种选择是使用 spread
来自 tidyr
library(tidyr)
spread(df, variables, value)
# names date Number of Coins Number of Shoes
#1 Arnold 2011-11-01 NA -0.44566197
#2 Arnold 2011-12-01 -1.2650612 1.22408180
#3 Arnold 2012-01-01 -0.6868529 0.35981383
#4 Jim 2011-11-01 -0.5604756 NA
#5 Jim 2011-12-01 -0.2301775 0.07050839
#6 Jim 2012-01-01 1.5587083 0.12928774
#7 Jim 2012-02-01 NA 1.71506499
#8 Jim 2012-03-01 NA 0.46091621