R 数据转换
R Data Transformation
我有一个包含三列的数据框,用于捕获交易数据,包括 CustomerName、OrderDate 和已购买产品的名称。我必须将数据框转换为另一个格式的数据框,以便客户在一个日期购买的所有商品都在一行中。
由于我正在处理一个大型数据集,是否有一种有效的方法来进行这种转换,希望不使用 for 循环。
此外,数据框中产品的列数必须等于任何客户在任何一天购买的最大产品数。请找一个转换前后数据框的例子
原始数据:
data <- data.frame(Customer = c("John", "John", "John", "Tom", "Tom", "Tom", "Sally", "Sally", "Sally", "Sally"),
OrderDate = c("1-Oct", "2-Oct", "2-Oct", "2-Oct","2-Oct", "2-Oct", "3-Oct", "3-Oct", "3-Oct", "3-Oct"),
Product = c("Milk", "Eggs", "Bread", "Chicken", "Pizza", "Beer", "Salad", "Apples", "Eggs", "Wine"),
stringsAsFactors = FALSE)
# Customer OrderDate Product
# 1 John 1-Oct Milk
# 2 John 2-Oct Eggs
# 3 John 2-Oct Bread
# 4 Tom 2-Oct Chicken
# 5 Tom 2-Oct Pizza
# 6 Tom 2-Oct Beer
# 7 Sally 3-Oct Salad
# 8 Sally 3-Oct Apples
# 9 Sally 3-Oct Eggs
# 10 Sally 3-Oct Wine
Post-转换:
datatransform <- as.data.frame(matrix(NA, nrow = 4, ncol = 6))
colnames(datatransform) <- c("Customer", "OrderDate", "Product1", "Product2", "Product3", "Product4")
datatransform$Customer <- c("John", "John", "Tom", "Sally")
datatransform$OrderDate <- c("1-Oct", "2-Oct", "2-Oct", "3-Oct")
datatransform[1, 3:6] <- c("Milk", "", "", "")
datatransform[2, 3:6 ] <- c("Eggs", "Bread", "", "")
datatransform[3, 3:6 ] <- c("Chicken", "Pizza", "Beer", "")
datatransform[4, 3:6 ] <- c("Salad", "Apples", "Eggs", "Wine")
# Customer OrderDate Product1 Product2 Product3 Product4
# 1 John 1-Oct Milk
# 2 John 2-Oct Eggs Bread
# 3 Tom 2-Oct Chicken Pizza Beer
# 4 Sally 3-Oct Salad Apples Eggs Wine
此外,数据框中产品的列数必须等于任何客户在任何一天购买的产品的最大数量。
既然你谈到了大数据集(那么效率是一个非常重要的考虑因素),这里是一个 dplyr 和 reshape2 解决方案:
library(reshape2)
library(dplyr)
data %>% group_by(Customer, OrderDate) %>%
mutate(ProductValue = paste0("Product", 1:n()) ) %>%
dcast(Customer + OrderDate ~ ProductValue, value.var = "Product" ) %>%
arrange(OrderDate)
Customer OrderDate Product1 Product2 Product3 Product4
1 John 1-Oct Milk <NA> <NA> <NA>
2 John 2-Oct Eggs Bread <NA> <NA>
3 Tom 2-Oct Chicken Pizza Beer <NA>
4 Sally 3-Oct Salad Apples Eggs Wine
我有一个包含三列的数据框,用于捕获交易数据,包括 CustomerName、OrderDate 和已购买产品的名称。我必须将数据框转换为另一个格式的数据框,以便客户在一个日期购买的所有商品都在一行中。
由于我正在处理一个大型数据集,是否有一种有效的方法来进行这种转换,希望不使用 for 循环。
此外,数据框中产品的列数必须等于任何客户在任何一天购买的最大产品数。请找一个转换前后数据框的例子
原始数据:
data <- data.frame(Customer = c("John", "John", "John", "Tom", "Tom", "Tom", "Sally", "Sally", "Sally", "Sally"),
OrderDate = c("1-Oct", "2-Oct", "2-Oct", "2-Oct","2-Oct", "2-Oct", "3-Oct", "3-Oct", "3-Oct", "3-Oct"),
Product = c("Milk", "Eggs", "Bread", "Chicken", "Pizza", "Beer", "Salad", "Apples", "Eggs", "Wine"),
stringsAsFactors = FALSE)
# Customer OrderDate Product
# 1 John 1-Oct Milk
# 2 John 2-Oct Eggs
# 3 John 2-Oct Bread
# 4 Tom 2-Oct Chicken
# 5 Tom 2-Oct Pizza
# 6 Tom 2-Oct Beer
# 7 Sally 3-Oct Salad
# 8 Sally 3-Oct Apples
# 9 Sally 3-Oct Eggs
# 10 Sally 3-Oct Wine
Post-转换:
datatransform <- as.data.frame(matrix(NA, nrow = 4, ncol = 6))
colnames(datatransform) <- c("Customer", "OrderDate", "Product1", "Product2", "Product3", "Product4")
datatransform$Customer <- c("John", "John", "Tom", "Sally")
datatransform$OrderDate <- c("1-Oct", "2-Oct", "2-Oct", "3-Oct")
datatransform[1, 3:6] <- c("Milk", "", "", "")
datatransform[2, 3:6 ] <- c("Eggs", "Bread", "", "")
datatransform[3, 3:6 ] <- c("Chicken", "Pizza", "Beer", "")
datatransform[4, 3:6 ] <- c("Salad", "Apples", "Eggs", "Wine")
# Customer OrderDate Product1 Product2 Product3 Product4
# 1 John 1-Oct Milk
# 2 John 2-Oct Eggs Bread
# 3 Tom 2-Oct Chicken Pizza Beer
# 4 Sally 3-Oct Salad Apples Eggs Wine
此外,数据框中产品的列数必须等于任何客户在任何一天购买的产品的最大数量。
既然你谈到了大数据集(那么效率是一个非常重要的考虑因素),这里是一个 dplyr 和 reshape2 解决方案:
library(reshape2)
library(dplyr)
data %>% group_by(Customer, OrderDate) %>%
mutate(ProductValue = paste0("Product", 1:n()) ) %>%
dcast(Customer + OrderDate ~ ProductValue, value.var = "Product" ) %>%
arrange(OrderDate)
Customer OrderDate Product1 Product2 Product3 Product4
1 John 1-Oct Milk <NA> <NA> <NA>
2 John 2-Oct Eggs Bread <NA> <NA>
3 Tom 2-Oct Chicken Pizza Beer <NA>
4 Sally 3-Oct Salad Apples Eggs Wine