Stringr 模式匹配:有没有办法识别多个产品描述?

Stringr pattern matching: Is there a way to identify multiple product descriptions?

在我的数据集中,我的产品描述显示为:

  1. 产品A,产品A,产品A

在其他行中为

  1. 产品 A、产品 B、产品 A、产品 B

在某些行中,就像

  1. 产品 A

最初,我的数据集包含以下格式的字符串:

  1. 产品 A、产品 B、产品 A、产品 B、产品 A、产品 B

  1. 产品A,产品A,产品A

因为我只想要每个产品的一个实例,所以我使用以下代码解决了这个问题:

df$lengths <- str_length(df$items)

df$new_items <- str_sub(df$items, 1, df$lengths/3)

有没有办法通过修改这段代码来解决上面的问题?

df <-
structure(list(Product_name = c("Samsung Galaxy A03s (4+64), Samsung Galaxy A03s (4+64)", 
"Samsung Galaxy A03s (3+32), Samsung Galaxy A03s (3+32), Samsung Galaxy A03s (3+32)", 
"Samsung A32 (6+128), Samsung A32 (6+128), Samsung A32 (6+128)", 
"samsung A02s (3+32), samsung A02s (3+32), samsung A02s (3+32), samsung A02s (3+32), samsung A02s(3+32)", 
"Xiaomi Redmi 10 (6+128), Xiaomi Redmi 10 (6+128)", "Redmi Note 10 Pro (6+128), Redmi Note 10 Pro (6+128), Redmi Note 10 Pro (6+128)"
)), class = "data.frame", row.names = c(NA, -6L))

编辑:

如果逗号分隔的字符串总是包含相同的元素,则顺序为更复杂的解决方案:

数据:

Product_name = c("Samsung Galaxy A03s (4+64), Samsung Galaxy A03s (3+32)", "Samsung Galaxy A03s (3+32), Samsung Galaxy A03s (3+32), Samsung Galaxy A03s (4+64)", "Samsung A32 (6+128), Samsung A32 (6+128), Samsung A32 (6+128)", "samsung A02s (3+32), samsung A02s (3+32), samsung A02s (3+32), samsung A02s (3+32), samsung A02s (3+32)", "Xiaomi Redmi 10 (6+128), Xiaomi Redmi 10 (6+128)", "Redmi Note 10 Pro (6+128), Redmi Note 10 Pro (6+128), Redmi Note 10 Pro (6+128)")

解决方案 1:基于否定字符 class、否定前瞻和反向引用的正则表达式解决方案——基本上,一个单行代码:

library(dplyr)
library(stringr)
data.frame(Product_name) %>%
  mutate(Product_name = str_extract_all(Product_name, "((?!\s)[^,]+)(?!.*\1)"))
                                             Product_name
1  Samsung Galaxy A03s (4+64), Samsung Galaxy A03s (3+32)
2  Samsung Galaxy A03s (3+32), Samsung Galaxy A03s (4+64)
3                                     Samsung A32 (6+128)
4                                     samsung A02s (3+32)
5                                 Xiaomi Redmi 10 (6+128)
6                               Redmi Note 10 Pro (6+128)

解决方案 2: 基于 tidyr 功能

library(tidyr)
library(dplyr)
data.frame(Product_name) %>%
  # create identifier:
  mutate(row = row_number()) %>%
  # separate rows into individual elements:
  separate_rows(Product_name, sep = ", ") %>%
  group_by(row) %>%
  # remove duplicated elements:
  filter(!duplicated(Product_name)) %>%
  # put distinct elements back into the same row:
  summarise(Product_name = toString(Product_name))
# A tibble: 6 x 2
    row Product_name                                          
  <int> <chr>                                                 
1     1 Samsung Galaxy A03s (4+64), Samsung Galaxy A03s (3+32)
2     2 Samsung Galaxy A03s (3+32), Samsung Galaxy A03s (4+64)
3     3 Samsung A32 (6+128)                                   
4     4 samsung A02s (3+32)                                   
5     5 Xiaomi Redmi 10 (6+128)                               
6     6 Redmi Note 10 Pro (6+128)

编辑前

此解决方案基于以下假设:字符串中以逗号分隔的元素始终相同:

library(stringr)
str_extract(Product_name, "[^,]+")
[1] "Samsung Galaxy A03s (4+64)" "Samsung Galaxy A03s (3+32)"
[3] "Samsung A32 (6+128)"        "samsung A02s (3+32)"       
[5] "Xiaomi Redmi 10 (6+128)"    "Redmi Note 10 Pro (6+128)"

数据:

Product_name = c("Samsung Galaxy A03s (4+64), Samsung Galaxy A03s (4+64)", "Samsung Galaxy A03s (3+32), Samsung Galaxy A03s (3+32), Samsung Galaxy A03s (3+32)", "Samsung A32 (6+128), Samsung A32 (6+128), Samsung A32 (6+128)", "samsung A02s (3+32), samsung A02s (3+32), samsung A02s (3+32), samsung A02s (3+32), samsung A02s (3+32)", "Xiaomi Redmi 10 (6+128), Xiaomi Redmi 10 (6+128)", "Redmi Note 10 Pro (6+128), Redmi Note 10 Pro (6+128), Redmi Note 10 Pro (6+128)")