Joining/grouping 在 R 中

Joining/grouping in R

我有 2 个这样的数据集: 水果

ID Apples Oranges Pears
1 0 1 1
2 1 0 0
3 1 1 0
4 0 0 1
5 1 0 0

此数据集表示具有该 ID 的人是否有该水果(1)或没有(0)。这里的ID是主键。

另一个数据集是 Juice。 table 代表该 ID 在给定日期制作的果汁。此数据集中没有重复项。

ID Dates
1 8/12/2021
1 6/9/2020
2 7/14/2020
2 3/6/2021
2 5/2/2020
3 8/31/2021
5 9/21/2020

我想要的输出是知道哪个水果被使用了多少次。如果一个 Id 有超过 1 个水果,则认为他用了两种水果来制作果汁。

让我们按列顺序- Apples- ID 2、ID 3 和 ID 5 有苹果。 ID 2榨汁3次,ID 3榨汁1次,ID 3榨汁1次,所以苹果用了5次(3+1+1)。同样,ID 1 和 ID 3 有橙色。 ID 1打了2次汁,ID 3打了1次汁,所以橙子用了3次(2+1)。 ID 1打了2次汁,ID 4打了0次汁,所以用了2次梨。

Fruit Count
Apples 5
Oranges 3
Pears 2

我想在 R 中使用此功能,Python 或 SQL,尽管我认为 R 具有解决此问题的最佳功能。我不太确定如何解决这个问题,因为涉及到两个 table。任何帮助将不胜感激。

R: 基础

tmp <- lapply(merge(Juice, Fruits, by = "ID", all.left = TRUE)[-(1:2)], sum)
data.frame(Fruit = names(tmp), Count = unlist(tmp, use.names = FALSE))
#     Fruit Count
# 1  Apples     5
# 2 Oranges     3
# 3   Pears     2

R: dplyr

library(dplyr)
library(tidyr) # pivot_longer
Fruits %>%
  pivot_longer(-ID, names_to = "Fruit") %>%
  right_join(Juice, by = "ID") %>%
  filter(value > 0) %>%
  count(Fruit)
# # A tibble: 3 x 2
#   Fruit       n
#   <chr>   <int>
# 1 Apples      5
# 2 Oranges     3
# 3 Pears       2

R: data.table

library(data.table)
JuiceDT <- as.data.table(Juice)    # canonical: setDT(Juice)
FruitsDT <- as.data.table(Fruits)
melt(JuiceDT[FruitsDT, on = .(ID), nomatch=NULL
             ][, lapply(.SD, sum), .SDcols = c("Apples", "Oranges", "Pears")],
     measure.vars = patterns("."),
     variable.name = "Fruit", value.name = "Count")
#      Fruit Count
#     <fctr> <int>
# 1:  Apples     5
# 2: Oranges     3
# 3:   Pears     2

备选方案,更符合上面的 dplyr 解决方案:

melt(FruitsDT, id.vars = "ID", variable.name = "Fruit"
  )[JuiceDT, on = .(ID)
  ][, .(Count = sum(value)), by = Fruit]
#      Fruit Count
#     <fctr> <int>
# 1:  Apples     5
# 2: Oranges     3
# 3:   Pears     2

SQL(通过 R 的 sqldf

sqldf::sqldf(
  "with cte as (select * from Juice j left join Fruits f on j.ID=f.ID)
   select 'Apples' as Fruit, sum(Apples) as Count from cte
   union all
   select 'Oranges' as Fruit, sum(Oranges) as Count from cte
   union all
   select 'Pears' as Fruit, sum(Pears) as Count from cte
")
#     Fruit Count
# 1  Apples     5
# 2 Oranges     3
# 3   Pears     2

这个实例正在使用不支持PIVOT的SQLite引擎。还有其他 sqldf 引擎可能支持它,并且对其他 DBMS 执行“原始 SQL” 应该允许人们在其方言中更自然地进行调整。


R数据

Juice <- structure(list(ID = c(1L, 1L, 2L, 2L, 2L, 3L, 5L), Dates = c("8/12/2021", "6/9/2020", "7/14/2020", "3/6/2021", "5/2/2020", "8/31/2021", "9/21/2020")), class = "data.frame", row.names = c(NA, -7L))
Fruits <- structure(list(ID = 1:5, Apples = c(0L, 1L, 1L, 0L, 1L), Oranges = c(1L, 0L, 1L, 0L, 0L), Pears = c(1L, 0L, 0L, 1L, 0L)), class = "data.frame", row.names = c(NA, -5L))