扩大数据框以获取 R 中分类列的所有唯一值的每月收入总和
Widening a dataframe to get monthly sums of revenue for all unique values of catogorical columns in R
我有一个 df,它有这样的数据:
sub = c("X001","X002", "X001","X003","X002","X001","X001","X003","X002","X003","X003","X002")
month = c("201506", "201507", "201506","201507","201507","201508", "201508","201507","201508","201508", "201508", "201508")
tech = c("mobile", "tablet", "PC","mobile","mobile","tablet", "PC","tablet","PC","PC", "mobile", "tablet")
brand = c("apple", "samsung", "dell","apple","samsung","apple", "samsung","dell","samsung","dell", "dell", "dell")
revenue = c(20, 15, 10,25,20,20, 17,9,14,12, 9, 11)
df = data.frame(sub, month, brand, tech, revenue)
我想使用 sub 和 month 作为键,并为每个订阅者每月获取一行,显示该订阅者当月在技术和品牌方面的独特价值的收入总和。这个例子被简化并且列数更少,因为我有一个庞大的数据集我决定尝试使用 data.table
.
我已经成功地为一个分类专栏做到了这一点,无论是技术还是品牌,都使用了这个:
df1 <- dcast(df, sub + month ~ tech, fun=sum, value.var = "revenue")
但我想对两个或多个 caqtogorical 列执行此操作,到目前为止我已经尝试过:
df2 <- dcast(df, sub + month ~ tech+brand, fun=sum, value.var = "revenue")
它只是连接了分类列的唯一值和总和,但我不希望这样。我想为所有分类列的每个唯一值单独列。
我是 R 的新手,非常感谢任何帮助。
(我假设 df
是一个 data.table
而不是像你的例子中的 data.frame
。)
一个可能的解决方案是首先 melt
数据,同时将 sub
、month
和 revenue
作为键。这样,brand
和 tech
将被转换为单个变量,其值对应于每个现有的键组合。这样我们就可以轻松地 dcast
返回,因为我们将针对单个列进行操作 - 就像您的第一个示例
dcast(melt(df, c(1:2, 5)), sub + month ~ value, sum, value.var = "revenue")
# sub month PC apple dell mobile samsung tablet
# 1: X001 201506 10 20 10 20 0 0
# 2: X001 201508 17 20 0 0 17 20
# 3: X002 201507 0 0 0 20 35 15
# 4: X002 201508 14 0 11 0 14 11
# 5: X003 201507 0 25 9 25 0 9
# 6: X003 201508 12 0 21 9 0 0
根据 OPs 评论,您可以通过在公式中添加 variable
来轻松添加前缀。这样,该列也将正确排序
dcast(melt(df, c(1:2, 5)), sub + month ~ variable + value, sum, value.var = "revenue")
# sub month brand_apple brand_dell brand_samsung tech_PC tech_mobile tech_tablet
# 1: X001 201506 20 10 0 10 20 0
# 2: X001 201508 20 0 17 17 0 20
# 3: X002 201507 0 0 35 0 20 15
# 4: X002 201508 0 11 14 14 0 11
# 5: X003 201507 25 9 0 0 25 9
# 6: X003 201508 0 21 0 12 9 0
我有一个 df,它有这样的数据:
sub = c("X001","X002", "X001","X003","X002","X001","X001","X003","X002","X003","X003","X002")
month = c("201506", "201507", "201506","201507","201507","201508", "201508","201507","201508","201508", "201508", "201508")
tech = c("mobile", "tablet", "PC","mobile","mobile","tablet", "PC","tablet","PC","PC", "mobile", "tablet")
brand = c("apple", "samsung", "dell","apple","samsung","apple", "samsung","dell","samsung","dell", "dell", "dell")
revenue = c(20, 15, 10,25,20,20, 17,9,14,12, 9, 11)
df = data.frame(sub, month, brand, tech, revenue)
我想使用 sub 和 month 作为键,并为每个订阅者每月获取一行,显示该订阅者当月在技术和品牌方面的独特价值的收入总和。这个例子被简化并且列数更少,因为我有一个庞大的数据集我决定尝试使用 data.table
.
我已经成功地为一个分类专栏做到了这一点,无论是技术还是品牌,都使用了这个:
df1 <- dcast(df, sub + month ~ tech, fun=sum, value.var = "revenue")
但我想对两个或多个 caqtogorical 列执行此操作,到目前为止我已经尝试过:
df2 <- dcast(df, sub + month ~ tech+brand, fun=sum, value.var = "revenue")
它只是连接了分类列的唯一值和总和,但我不希望这样。我想为所有分类列的每个唯一值单独列。
我是 R 的新手,非常感谢任何帮助。
(我假设 df
是一个 data.table
而不是像你的例子中的 data.frame
。)
一个可能的解决方案是首先 melt
数据,同时将 sub
、month
和 revenue
作为键。这样,brand
和 tech
将被转换为单个变量,其值对应于每个现有的键组合。这样我们就可以轻松地 dcast
返回,因为我们将针对单个列进行操作 - 就像您的第一个示例
dcast(melt(df, c(1:2, 5)), sub + month ~ value, sum, value.var = "revenue")
# sub month PC apple dell mobile samsung tablet
# 1: X001 201506 10 20 10 20 0 0
# 2: X001 201508 17 20 0 0 17 20
# 3: X002 201507 0 0 0 20 35 15
# 4: X002 201508 14 0 11 0 14 11
# 5: X003 201507 0 25 9 25 0 9
# 6: X003 201508 12 0 21 9 0 0
根据 OPs 评论,您可以通过在公式中添加 variable
来轻松添加前缀。这样,该列也将正确排序
dcast(melt(df, c(1:2, 5)), sub + month ~ variable + value, sum, value.var = "revenue")
# sub month brand_apple brand_dell brand_samsung tech_PC tech_mobile tech_tablet
# 1: X001 201506 20 10 0 10 20 0
# 2: X001 201508 20 0 17 17 0 20
# 3: X002 201507 0 0 35 0 20 15
# 4: X002 201508 0 11 14 14 0 11
# 5: X003 201507 25 9 0 0 25 9
# 6: X003 201508 0 21 0 12 9 0