使用 R (StringR) 中的正则表达式提取购物篮分析的产品项目时出错
Error extracting product items for Market Basket Analysis using regular expressions in R (StringR)
order_id
product_name
1
The Ordinary - High-Adherence Silicone Primer - 30ml, The Ordinary - Natural Moisturizing Factors + HA 30ml
2
Sandal, Brown - 44
3
Acetate - Square - Black - Transition - Sunglasses, Cartier - 8221 - Rim less - Green Double Shade - Sunglasses, Ray Ban - Aviator - Brown Double Shade - 3026 - Diamond Hard - Unbreakable lens, Burberry - 2A357 - Havana - Aviator - Sunglasses, Acetate - Square - Black - Transition - Sunglasses, Cartier - 8221 - Rim less - Green Double Shade - Sunglasses, Ray Ban - Aviator - Brown Double Shade - 3026 - Diamond Hard - Unbreakable lens, Burberry - 2A357 - Havana - Aviator - Sunglasses
4
NasGas Instant Geyser DG6L, NasGas Instant Geyser DG6L, NasGas Instant Geyser DG6L
5
Mpow Flame Solo Bluetooth Earbuds, Punchy Bass IPX7 Waterproof In Ear Wireless Earphones Bluetooth Headphones, USB-CFast ChargingBT5.028H Playtime Built-in Mic for Running Workout, Mpow Flame Solo Bluetooth Earbuds, Punchy Bass IPX7 Waterproof In Ear Wireless Earphones Bluetooth Headphones, USB-CFast ChargingBT5.028H Playtime Built-in Mic for Running Workout
以上是数据集中产品项目的样本集。这就是项目在数据库中的存储方式。
考虑订单 ID 3:
第一项是醋酸纤维……第二项是卡地亚……第三项是巴宝莉……之后,这些项目只重复两次,在某些项目情况下(订单号 4)重复三次。我需要删除这个重复。在这种情况下,分隔符是逗号。
其次:
考虑订单 ID 4:这里我不能根据逗号分隔项目,因为第一个产品项目在 Workout 结束并且在一个产品项目描述中有逗号
我之前使用的是下面的代码
data.frame(tran_pay4) %>%
mutate(product_name = str_extract_all(product_name, "((?!\s)[^,]+)(?!.*\1)"))
这解决了大多数购物车的问题,但不适用于 case::order_id = 5
objective 是保留单个产品项目。
输出应如下所示:
order_id
product_name
1
The Ordinary - High-Adherence Silicone Primer - 30ml, The Ordinary - Natural Moisturizing Factors + HA 30ml
2
Sandal, Brown - 44
3
Acetate - Square - Black - Transition - Sunglasses, Cartier - 8221 - Rim less - Green Double Shade - Sunglasses, Ray Ban - Aviator - Brown Double Shade - 3026 - Diamond Hard - Unbreakable lens, Burberry - 2A357 - Havana - Aviator - Sunglasses
4
NasGas Instant Geyser DG6L
5
Mpow Flame Solo Bluetooth Earbuds Punchy Bass IPX7 Waterproof In Ear Wireless Earphones Bluetooth Headphones USB-CFast ChargingBT5.028H Playtime Built-in Mic for Running Workout
请告诉我该怎么做?
您不需要正则表达式。您可以简单地使用 strsplit
和 unique
来查找独特的项目。
tran_pay4$newproduct = sapply(strsplit(tran_pay4$product_name, ", "),
function(x) paste(unique(x), collapse = ", "))
order_id | product_name |
---|---|
1 | The Ordinary - High-Adherence Silicone Primer - 30ml, The Ordinary - Natural Moisturizing Factors + HA 30ml |
2 | Sandal, Brown - 44 |
3 | Acetate - Square - Black - Transition - Sunglasses, Cartier - 8221 - Rim less - Green Double Shade - Sunglasses, Ray Ban - Aviator - Brown Double Shade - 3026 - Diamond Hard - Unbreakable lens, Burberry - 2A357 - Havana - Aviator - Sunglasses, Acetate - Square - Black - Transition - Sunglasses, Cartier - 8221 - Rim less - Green Double Shade - Sunglasses, Ray Ban - Aviator - Brown Double Shade - 3026 - Diamond Hard - Unbreakable lens, Burberry - 2A357 - Havana - Aviator - Sunglasses |
4 | NasGas Instant Geyser DG6L, NasGas Instant Geyser DG6L, NasGas Instant Geyser DG6L |
5 | Mpow Flame Solo Bluetooth Earbuds, Punchy Bass IPX7 Waterproof In Ear Wireless Earphones Bluetooth Headphones, USB-CFast ChargingBT5.028H Playtime Built-in Mic for Running Workout, Mpow Flame Solo Bluetooth Earbuds, Punchy Bass IPX7 Waterproof In Ear Wireless Earphones Bluetooth Headphones, USB-CFast ChargingBT5.028H Playtime Built-in Mic for Running Workout |
以上是数据集中产品项目的样本集。这就是项目在数据库中的存储方式。
考虑订单 ID 3: 第一项是醋酸纤维……第二项是卡地亚……第三项是巴宝莉……之后,这些项目只重复两次,在某些项目情况下(订单号 4)重复三次。我需要删除这个重复。在这种情况下,分隔符是逗号。
其次:
考虑订单 ID 4:这里我不能根据逗号分隔项目,因为第一个产品项目在 Workout 结束并且在一个产品项目描述中有逗号
我之前使用的是下面的代码
data.frame(tran_pay4) %>%
mutate(product_name = str_extract_all(product_name, "((?!\s)[^,]+)(?!.*\1)"))
这解决了大多数购物车的问题,但不适用于 case::order_id = 5 objective 是保留单个产品项目。
输出应如下所示:
order_id | product_name |
---|---|
1 | The Ordinary - High-Adherence Silicone Primer - 30ml, The Ordinary - Natural Moisturizing Factors + HA 30ml |
2 | Sandal, Brown - 44 |
3 | Acetate - Square - Black - Transition - Sunglasses, Cartier - 8221 - Rim less - Green Double Shade - Sunglasses, Ray Ban - Aviator - Brown Double Shade - 3026 - Diamond Hard - Unbreakable lens, Burberry - 2A357 - Havana - Aviator - Sunglasses |
4 | NasGas Instant Geyser DG6L |
5 | Mpow Flame Solo Bluetooth Earbuds Punchy Bass IPX7 Waterproof In Ear Wireless Earphones Bluetooth Headphones USB-CFast ChargingBT5.028H Playtime Built-in Mic for Running Workout |
请告诉我该怎么做?
您不需要正则表达式。您可以简单地使用 strsplit
和 unique
来查找独特的项目。
tran_pay4$newproduct = sapply(strsplit(tran_pay4$product_name, ", "),
function(x) paste(unique(x), collapse = ", "))