R 数据转换

R Data Transformation

我有一个包含三列的数据框,用于捕获交易数据,包括 CustomerName、OrderDate 和已购买产品的名称。我必须将数据框转换为另一个格式的数据框,以便客户在一个日期购买的所有商品都在一行中。

由于我正在处理一个大型数据集,是否有一种有效的方法来进行这种转换,希望不使用 for 循环。

此外,数据框中产品的列数必须等于任何客户在任何一天购买的最大产品数。请找一个转换前后数据框的例子

原始数据:

data <- data.frame(Customer  = c("John", "John", "John", "Tom", "Tom", "Tom", "Sally", "Sally", "Sally", "Sally"),
                   OrderDate = c("1-Oct", "2-Oct", "2-Oct", "2-Oct","2-Oct", "2-Oct", "3-Oct", "3-Oct", "3-Oct", "3-Oct"),
                   Product   = c("Milk", "Eggs", "Bread", "Chicken", "Pizza", "Beer", "Salad", "Apples", "Eggs", "Wine"),
                   stringsAsFactors = FALSE)

#    Customer OrderDate Product
# 1      John     1-Oct    Milk
# 2      John     2-Oct    Eggs
# 3      John     2-Oct   Bread
# 4       Tom     2-Oct Chicken
# 5       Tom     2-Oct   Pizza
# 6       Tom     2-Oct    Beer
# 7     Sally     3-Oct   Salad
# 8     Sally     3-Oct  Apples
# 9     Sally     3-Oct    Eggs
# 10    Sally     3-Oct    Wine

Post-转换:

datatransform <- as.data.frame(matrix(NA, nrow = 4, ncol = 6))
colnames(datatransform) <- c("Customer", "OrderDate", "Product1", "Product2", "Product3", "Product4")
datatransform$Customer <- c("John", "John", "Tom", "Sally")
datatransform$OrderDate <- c("1-Oct", "2-Oct", "2-Oct", "3-Oct")
datatransform[1, 3:6] <- c("Milk", "", "", "") 
datatransform[2, 3:6 ] <- c("Eggs", "Bread", "", "")
datatransform[3, 3:6 ] <- c("Chicken", "Pizza", "Beer", "")
datatransform[4, 3:6 ] <- c("Salad", "Apples", "Eggs", "Wine")

#   Customer OrderDate Product1 Product2 Product3 Product4
# 1     John     1-Oct     Milk                           
# 2     John     2-Oct     Eggs    Bread                  
# 3      Tom     2-Oct  Chicken    Pizza     Beer         
# 4    Sally     3-Oct    Salad   Apples     Eggs     Wine

此外,数据框中产品的列数必须等于任何客户在任何一天购买的产品的最大数量。

既然你谈到了大数据集(那么效率是一个非常重要的考虑因素),这里是一个 dplyr 和 reshape2 解决方案:

library(reshape2)
library(dplyr)

data  %>% group_by(Customer, OrderDate) %>%
          mutate(ProductValue = paste0("Product", 1:n()) ) %>%
          dcast(Customer + OrderDate ~ ProductValue, value.var = "Product"  ) %>%
          arrange(OrderDate)

  Customer OrderDate Product1 Product2 Product3 Product4
1     John     1-Oct     Milk     <NA>     <NA>     <NA>
2     John     2-Oct     Eggs    Bread     <NA>     <NA>
3      Tom     2-Oct  Chicken    Pizza     Beer     <NA>
4    Sally     3-Oct    Salad   Apples     Eggs     Wine