使用 R (StringR) 中的正则表达式提取购物篮分析的产品项目时出错

Error extracting product items for Market Basket Analysis using regular expressions in R (StringR)

order_id product_name
1 The Ordinary - High-Adherence Silicone Primer - 30ml, The Ordinary - Natural Moisturizing Factors + HA 30ml
2 Sandal, Brown - 44
3 Acetate - Square - Black - Transition - Sunglasses, Cartier - 8221 - Rim less - Green Double Shade - Sunglasses, Ray Ban - Aviator - Brown Double Shade - 3026 - Diamond Hard - Unbreakable lens, Burberry - 2A357 - Havana - Aviator - Sunglasses, Acetate - Square - Black - Transition - Sunglasses, Cartier - 8221 - Rim less - Green Double Shade - Sunglasses, Ray Ban - Aviator - Brown Double Shade - 3026 - Diamond Hard - Unbreakable lens, Burberry - 2A357 - Havana - Aviator - Sunglasses
4 NasGas Instant Geyser DG6L, NasGas Instant Geyser DG6L, NasGas Instant Geyser DG6L
5 Mpow Flame Solo Bluetooth Earbuds, Punchy Bass IPX7 Waterproof In Ear Wireless Earphones Bluetooth Headphones, USB-CFast ChargingBT5.028H Playtime Built-in Mic for Running Workout, Mpow Flame Solo Bluetooth Earbuds, Punchy Bass IPX7 Waterproof In Ear Wireless Earphones Bluetooth Headphones, USB-CFast ChargingBT5.028H Playtime Built-in Mic for Running Workout

以上是数据集中产品项目的样本集。这就是项目在数据库中的存储方式。

考虑订单 ID 3: 第一项是醋酸纤维……第二项是卡地亚……第三项是巴宝莉……之后,这些项目只重复两次,在某些项目情况下(订单号 4)重复三次。我需要删除这个重复。在这种情况下,分隔符是逗号。

其次:

考虑订单 ID 4:这里我不能根据逗号分隔项目,因为第一个产品项目在 Workout 结束并且在一个产品项目描述中有逗号

我之前使用的是下面的代码

data.frame(tran_pay4) %>%
  mutate(product_name = str_extract_all(product_name, "((?!\s)[^,]+)(?!.*\1)"))

这解决了大多数购物车的问题,但不适用于 case::order_id = 5 objective 是保留单个产品项目。

输出应如下所示:

order_id product_name
1 The Ordinary - High-Adherence Silicone Primer - 30ml, The Ordinary - Natural Moisturizing Factors + HA 30ml
2 Sandal, Brown - 44
3 Acetate - Square - Black - Transition - Sunglasses, Cartier - 8221 - Rim less - Green Double Shade - Sunglasses, Ray Ban - Aviator - Brown Double Shade - 3026 - Diamond Hard - Unbreakable lens, Burberry - 2A357 - Havana - Aviator - Sunglasses
4 NasGas Instant Geyser DG6L
5 Mpow Flame Solo Bluetooth Earbuds Punchy Bass IPX7 Waterproof In Ear Wireless Earphones Bluetooth Headphones USB-CFast ChargingBT5.028H Playtime Built-in Mic for Running Workout

请告诉我该怎么做?

您不需要正则表达式。您可以简单地使用 strsplitunique 来查找独特的项目。

tran_pay4$newproduct = sapply(strsplit(tran_pay4$product_name, ", "), 
                              function(x) paste(unique(x), collapse = ", "))