如何在 R 中的两个多类别数据列 table 之间找到 difference/setdiff()
How to find difference/setdiff() between two multi-category columns of data table in R
我的数据在两个数据表中,如下所示(比此处显示的列多得多) -
数据表 1 = data_sale
Site Id
Country
Product ID
1000375476
Canada
UG10000-WISD
1000375476
Canada
UGD12895
1000706152
Switzerland
UG10000-WISD
1000706152
Switzerland
UG80000-NTCD-G
1000797366
Italy
UG10000-WISD
1000797366
Italy
UG12210
数据表 2 = data_licenses
Site Id
Country
Product ID
1000375476
Canada
UG10000-WISD
1000375476
Canada
UGD12895
1000797366
Italy
UG12785
1000797366
Italy
UG12210
I want to calculate the set difference for unique Product ID for all the Site Id in data_sale, keeping all rows.
这是我到目前为止所做的 -
- 对于这两个数据表,我创建了一个新列,其中包含所有独特的产品。
data_sale <-
data_sale[, `unique_products` := paste0(unique(`Product ID`), collapse = ","),
keyby = c("Site Id")]
data_licenses <-
data_licenses[, .(`unique_products` = paste0(unique(`Product ID`), collapse = ",")),
keyby = c("Site Id")]
- 左合并 data_sale 与 data_licenses
merge(data_sale, data_licenses, by = 'Site Id', all.x = TRUE)
现在合并后的数据表是这样的 -
Site Id
Country
Product ID
unique_products.data_sale
unique_products.data_licenses
1000375476
Canada
UG10000-WISD
UG10000-WISD,UGD12895
UG10000-WISD,UGD12895
1000375476
Canada
UGD12895
UG10000-WISD,UGD12895
UG10000-WISD,UGD12895
1000706152
Switzerland
UG10000-WISD
UG10000-WISD,UG80000-NTCD-G
NA
1000706152
Switzerland
UG80000-NTCD-G
UG10000-WISD,UG80000-NTCD-G
NA
1000797366
Italy
UG10000-WISD
UG10000-WISD,UG12210
UG12785,UG12210
1000797366
Italy
UG12210
UG10000-WISD,UG12210
UG12785,UG12210
The problem is with my final step where I want a new column showing difference between the products of data_sale and data_licenses, it should look like this -
Site Id
Country
Product ID
unique_products.data_sale
unique_products.data_licenses
difference
1000375476
Canada
UG10000-WISD
UG10000-WISD,UGD12895
UG10000-WISD,UGD12895
NA
1000375476
Canada
UGD12895
UG10000-WISD,UGD12895
UG10000-WISD,UGD12895
NA
1000706152
Switzerland
UG10000-WISD
UG10000-WISD,UG80000-NTCD-G
NA
UG10000-WISD,UG80000-NTCD-G
1000706152
Switzerland
UG80000-NTCD-G
UG10000-WISD,UG80000-NTCD-G
NA
UG10000-WISD,UG80000-NTCD-G
1000797366
Italy
UG10000-WISD
UG10000-WISD,UG12210
UG12785,UG12210
UG10000-WISD
1000797366
Italy
UG12210
UG10000-WISD,UG12210
UG12785,UG12210
UG10000-WISD
关于如何实现它的任何线索都会有很大帮助。谢谢!
下面是合并数据表使用dput()得到的数据
structure(list(`Site Id` = c("1000375476", "1000375476", "1000706152",
"1000706152", "1000797366", "1000797366"), Country = c("Canada",
"Canada", "Switzerland", "Switzerland", "Italy", "Italy"), `Product ID` = c("UG10000-WISD",
"UGD12895", "UG10000-WISD", "UG80000-NTCD-G", "UG10000-WISD",
"UG12210"), unique_products.x = c("UG10000-WISD,UGD12895", "UG10000-WISD,UGD12895",
"UG10000-WISD,UG80000-NTCD-G", "UG10000-WISD,UG80000-NTCD-G",
"UG10000-WISD,UG12210", "UG10000-WISD,UG12210"), unique_products.y = c("UG10000-WISD,UGD12895",
"UG10000-WISD,UGD12895", NA, NA, "UG12785,UG12210", "UG12785,UG12210"
)), sorted = "Site Id", class = c("data.table", "data.frame"), row.names = c(NA,
-6L), .internal.selfref = <pointer: 0x556bb5c10a40>)
可能有一种方法可以尝试组合一些 built-in 函数,但是有一个简单的自定义函数的例子:
find_differences = function(x,y){
# x: column list of strings we want to compare to
# y: other column list
x = strsplit(x,',') # transform strings to lists
y = strsplit(y,',')
differences = list()
for(i in seq(1,length(x))){ # for every row (nested-list)
if(identical(x[[i]],y[[i]])){
row_diff = NA
}
else{
row_diff = paste(x[[i]][ ! x[[i]] %in% y[[i]] ],collapse=',')
}
differences = c(differences,row_diff)
}
return(differences)
}
以你为例:
example = rename(example,
unique_products.data_sale = unique_products.x,
unique_products.data_licenses = unique_products.y)
example$difference = find_differences(example$unique_products.data_sale, example$unique_products.data_license)
> example
Site Id Country Product ID unique_products.data_sale unique_products.data_licenses difference
1: 1000375476 Canada UG10000-WISD UG10000-WISD,UGD12895 UG10000-WISD,UGD12895 NA
2: 1000375476 Canada UGD12895 UG10000-WISD,UGD12895 UG10000-WISD,UGD12895 NA
3: 1000706152 Switzerland UG10000-WISD UG10000-WISD,UG80000-NTCD-G <NA> UG10000-WISD,UG80000-NTCD-G
4: 1000706152 Switzerland UG80000-NTCD-G UG10000-WISD,UG80000-NTCD-G <NA> UG10000-WISD,UG80000-NTCD-G
5: 1000797366 Italy UG10000-WISD UG10000-WISD,UG12210 UG12785,UG12210 UG10000-WISD
6: 1000797366 Italy UG12210 UG10000-WISD,UG12210 UG12785,UG12210 UG10000-WISD
这将得到 data_sale
中 data_license
中没有的产品 Site Id
。与其连接唯一产品 ID,不如将唯一列用作字符向量更容易。
library(data.table)
data_licenses <- data.table(`Site Id` = c("1000375476", "1000375476", "1000797366", "1000797366"),
Country = c("Canada", "Canada", "Italy", "Italy"),
`Product ID` = c("UG10000-WISD", "UGD12895", "UG12785", "UG12210"))
data_sale <- data.table(`Site Id` = c("1000375476", "1000375476", "1000706152", "1000706152", "1000797366", "1000797366"),
Country = c("Canada", "Canada", "Switzerland", "Switzerland", "Italy", "Italy"),
`Product ID` = c("UG10000-WISD", "UGD12895", "UG10000-WISD", "UG80000-NTCD-G", "UG10000-WISD", "UG12210"))
data_unique <- data_sale[
, .(unique_products.data_sale = .(unique(`Product ID`))), c("Site Id", "Country")
][
data_licenses[, .(unique_products = .(unique(`Product ID`))), "Site Id"],
unique_products.data_licenses := i.unique_products,
on = "Site Id"
][
, difference := lapply(.I, function(i) setdiff(unique_products.data_sale[[i]], unique_products.data_licenses[[i]]))
]
print(data_unique)
#> Site Id Country unique_products.data_sale unique_products.data_licenses difference
#> 1: 1000375476 Canada UG10000-WISD,UGD12895 UG10000-WISD,UGD12895
#> 2: 1000706152 Switzerland UG10000-WISD,UG80000-NTCD-G UG10000-WISD,UG80000-NTCD-G
#> 3: 1000797366 Italy UG10000-WISD,UG12210 UG12785,UG12210 UG10000-WISD
这种快速获取每个站点差异的方法怎么样?然后您可以将结果合并回具有 Site Id
:
的任何帧
data_licenses[, .(licen_p = .(.(`Product ID`))), by = `Site Id`] %>%
.[data_sale[, .(sale_p= .(.(`Product ID`))), by=`Site Id`],on=.(`Site Id`)] %>%
.[,.(difference = toString(unlist(setdiff(sale_p, licen_p)))), by=`Site Id`]
输出:
Site Id difference
1: 1000375476
2: 1000706152 UG10000-WISD, UG80000-NTCD-G
3: 1000797366 UG10000-WISD, UG12210
我的数据在两个数据表中,如下所示(比此处显示的列多得多) -
数据表 1 = data_sale
Site Id | Country | Product ID |
---|---|---|
1000375476 | Canada | UG10000-WISD |
1000375476 | Canada | UGD12895 |
1000706152 | Switzerland | UG10000-WISD |
1000706152 | Switzerland | UG80000-NTCD-G |
1000797366 | Italy | UG10000-WISD |
1000797366 | Italy | UG12210 |
数据表 2 = data_licenses
Site Id | Country | Product ID |
---|---|---|
1000375476 | Canada | UG10000-WISD |
1000375476 | Canada | UGD12895 |
1000797366 | Italy | UG12785 |
1000797366 | Italy | UG12210 |
I want to calculate the set difference for unique Product ID for all the Site Id in data_sale, keeping all rows.
这是我到目前为止所做的 -
- 对于这两个数据表,我创建了一个新列,其中包含所有独特的产品。
data_sale <-
data_sale[, `unique_products` := paste0(unique(`Product ID`), collapse = ","),
keyby = c("Site Id")]
data_licenses <-
data_licenses[, .(`unique_products` = paste0(unique(`Product ID`), collapse = ",")),
keyby = c("Site Id")]
- 左合并 data_sale 与 data_licenses
merge(data_sale, data_licenses, by = 'Site Id', all.x = TRUE)
现在合并后的数据表是这样的 -
Site Id | Country | Product ID | unique_products.data_sale | unique_products.data_licenses |
---|---|---|---|---|
1000375476 | Canada | UG10000-WISD | UG10000-WISD,UGD12895 | UG10000-WISD,UGD12895 |
1000375476 | Canada | UGD12895 | UG10000-WISD,UGD12895 | UG10000-WISD,UGD12895 |
1000706152 | Switzerland | UG10000-WISD | UG10000-WISD,UG80000-NTCD-G | NA |
1000706152 | Switzerland | UG80000-NTCD-G | UG10000-WISD,UG80000-NTCD-G | NA |
1000797366 | Italy | UG10000-WISD | UG10000-WISD,UG12210 | UG12785,UG12210 |
1000797366 | Italy | UG12210 | UG10000-WISD,UG12210 | UG12785,UG12210 |
The problem is with my final step where I want a new column showing difference between the products of data_sale and data_licenses, it should look like this -
Site Id | Country | Product ID | unique_products.data_sale | unique_products.data_licenses | difference |
---|---|---|---|---|---|
1000375476 | Canada | UG10000-WISD | UG10000-WISD,UGD12895 | UG10000-WISD,UGD12895 | NA |
1000375476 | Canada | UGD12895 | UG10000-WISD,UGD12895 | UG10000-WISD,UGD12895 | NA |
1000706152 | Switzerland | UG10000-WISD | UG10000-WISD,UG80000-NTCD-G | NA | UG10000-WISD,UG80000-NTCD-G |
1000706152 | Switzerland | UG80000-NTCD-G | UG10000-WISD,UG80000-NTCD-G | NA | UG10000-WISD,UG80000-NTCD-G |
1000797366 | Italy | UG10000-WISD | UG10000-WISD,UG12210 | UG12785,UG12210 | UG10000-WISD |
1000797366 | Italy | UG12210 | UG10000-WISD,UG12210 | UG12785,UG12210 | UG10000-WISD |
关于如何实现它的任何线索都会有很大帮助。谢谢!
下面是合并数据表使用dput()得到的数据
structure(list(`Site Id` = c("1000375476", "1000375476", "1000706152",
"1000706152", "1000797366", "1000797366"), Country = c("Canada",
"Canada", "Switzerland", "Switzerland", "Italy", "Italy"), `Product ID` = c("UG10000-WISD",
"UGD12895", "UG10000-WISD", "UG80000-NTCD-G", "UG10000-WISD",
"UG12210"), unique_products.x = c("UG10000-WISD,UGD12895", "UG10000-WISD,UGD12895",
"UG10000-WISD,UG80000-NTCD-G", "UG10000-WISD,UG80000-NTCD-G",
"UG10000-WISD,UG12210", "UG10000-WISD,UG12210"), unique_products.y = c("UG10000-WISD,UGD12895",
"UG10000-WISD,UGD12895", NA, NA, "UG12785,UG12210", "UG12785,UG12210"
)), sorted = "Site Id", class = c("data.table", "data.frame"), row.names = c(NA,
-6L), .internal.selfref = <pointer: 0x556bb5c10a40>)
可能有一种方法可以尝试组合一些 built-in 函数,但是有一个简单的自定义函数的例子:
find_differences = function(x,y){
# x: column list of strings we want to compare to
# y: other column list
x = strsplit(x,',') # transform strings to lists
y = strsplit(y,',')
differences = list()
for(i in seq(1,length(x))){ # for every row (nested-list)
if(identical(x[[i]],y[[i]])){
row_diff = NA
}
else{
row_diff = paste(x[[i]][ ! x[[i]] %in% y[[i]] ],collapse=',')
}
differences = c(differences,row_diff)
}
return(differences)
}
以你为例:
example = rename(example,
unique_products.data_sale = unique_products.x,
unique_products.data_licenses = unique_products.y)
example$difference = find_differences(example$unique_products.data_sale, example$unique_products.data_license)
> example
Site Id Country Product ID unique_products.data_sale unique_products.data_licenses difference
1: 1000375476 Canada UG10000-WISD UG10000-WISD,UGD12895 UG10000-WISD,UGD12895 NA
2: 1000375476 Canada UGD12895 UG10000-WISD,UGD12895 UG10000-WISD,UGD12895 NA
3: 1000706152 Switzerland UG10000-WISD UG10000-WISD,UG80000-NTCD-G <NA> UG10000-WISD,UG80000-NTCD-G
4: 1000706152 Switzerland UG80000-NTCD-G UG10000-WISD,UG80000-NTCD-G <NA> UG10000-WISD,UG80000-NTCD-G
5: 1000797366 Italy UG10000-WISD UG10000-WISD,UG12210 UG12785,UG12210 UG10000-WISD
6: 1000797366 Italy UG12210 UG10000-WISD,UG12210 UG12785,UG12210 UG10000-WISD
这将得到 data_sale
中 data_license
中没有的产品 Site Id
。与其连接唯一产品 ID,不如将唯一列用作字符向量更容易。
library(data.table)
data_licenses <- data.table(`Site Id` = c("1000375476", "1000375476", "1000797366", "1000797366"),
Country = c("Canada", "Canada", "Italy", "Italy"),
`Product ID` = c("UG10000-WISD", "UGD12895", "UG12785", "UG12210"))
data_sale <- data.table(`Site Id` = c("1000375476", "1000375476", "1000706152", "1000706152", "1000797366", "1000797366"),
Country = c("Canada", "Canada", "Switzerland", "Switzerland", "Italy", "Italy"),
`Product ID` = c("UG10000-WISD", "UGD12895", "UG10000-WISD", "UG80000-NTCD-G", "UG10000-WISD", "UG12210"))
data_unique <- data_sale[
, .(unique_products.data_sale = .(unique(`Product ID`))), c("Site Id", "Country")
][
data_licenses[, .(unique_products = .(unique(`Product ID`))), "Site Id"],
unique_products.data_licenses := i.unique_products,
on = "Site Id"
][
, difference := lapply(.I, function(i) setdiff(unique_products.data_sale[[i]], unique_products.data_licenses[[i]]))
]
print(data_unique)
#> Site Id Country unique_products.data_sale unique_products.data_licenses difference
#> 1: 1000375476 Canada UG10000-WISD,UGD12895 UG10000-WISD,UGD12895
#> 2: 1000706152 Switzerland UG10000-WISD,UG80000-NTCD-G UG10000-WISD,UG80000-NTCD-G
#> 3: 1000797366 Italy UG10000-WISD,UG12210 UG12785,UG12210 UG10000-WISD
这种快速获取每个站点差异的方法怎么样?然后您可以将结果合并回具有 Site Id
:
data_licenses[, .(licen_p = .(.(`Product ID`))), by = `Site Id`] %>%
.[data_sale[, .(sale_p= .(.(`Product ID`))), by=`Site Id`],on=.(`Site Id`)] %>%
.[,.(difference = toString(unlist(setdiff(sale_p, licen_p)))), by=`Site Id`]
输出:
Site Id difference
1: 1000375476
2: 1000706152 UG10000-WISD, UG80000-NTCD-G
3: 1000797366 UG10000-WISD, UG12210