正则表达式查找每两个逗号，使用 dplyr 将行与列分开

Question

我有这样的字符串：

71,72,80,81,102,100

我想把每两个“数字”分开，所以：

71,72
80,81
102,100

我写了这个正则表达式：

(([0-9]{1,4}),([0-9]{1,4}))

其中突出显示了我需要的组，除了“,”之间的逗号外，“

在我的代码中我使用 dplyr

示例：

df_example <- tibble(Lotes= "LOT1,LOT2,LOT3",NoModuloPlastico = "71,72,80,81,102,100")

df_result_example <- df_example %>%
separate_rows(c(Lotes),sep=",") %>%
separate_rows(c(NoModuloPlastico),sep="(([0-9]{1,3}),([0-9]{1,3}))")

这意味着我真正需要的是用正则表达式突出显示每 2 个逗号，但我找不到方法。

我无法根据自己的需要调整这些链接：

https://bedigit.com/blog/regex-how-to-match-everything-except-a-particular-pattern/

https://blog.codinghorror.com/excluding-matches-with-regular-expressions/

我得到的：

Lotes	NoModuloPlastico
LOT1	""
LOT1	","
LOT1	","
LOT1	""
LOT2	""
LOT2	","
LOT2	","
LOT2	""
LOT3	""
LOT3	","
LOT3	","
LOT3	""

我想要的：

Lotes	NoModuloPlastico
LOT1	71,72
LOT2	80,81
LOT3	102,100

Answer 1

你可以这样做：

df_example %>%
  mutate(Lotes = str_split(Lotes, ','),
         NoModuloPlastico = NoModuloPlastico %>%
           str_replace_all('([^,]+,[^,]+),', '\1:') %>%
           str_split(':')) %>%
  unnest(everything())

# A tibble: 3 x 2
  Lotes NoModuloPlastico
  <chr> <chr>           
1 LOT1  71,72           
2 LOT2  80,81           
3 LOT3  102,100

Answer 2

我不知道这是否概括了您的问题，如果没有，请尝试找到此方法的变体。

# Your tibble
df_example  = dplyr::tibble(Lotes= "LOT1,LOT2,LOT3",NoModuloPlastico = "71,72,80,81,102,100")

# with strsplit separate the strings by "," and turn them into a matrix
A = matrix(strsplit(df_example$Lotes, split = ",")[[1]],ncol=1,)
B = matrix(strsplit(df_example$NoModuloPlastico, split = ",")[[1]], ncol = 2)

# cbind the two matrices, turn that into a dataframe and give names to the columns
C = as.data.frame(cbind(A,B))
colnames(C) = c("Lotes", "Modulo1", "Modulo2")

# Create your new column with paste0() function
C$NoModuloPlastico = paste0(C$Modulo1, ",",C$Modulo2)

# This is extra but only for following your variable name create that with the two columns.
df_result_example = data.frame(C$Lotes, C$NoModuloPlastico)

这解决了您在 base R 中的示例。

Answer 3

你可以使用稍微缩短的 :

df_example %>% 
  mutate(Lotes = strsplit(Lotes, ','),
    NoModuloPlastico = NoModuloPlastico %>% 
      strsplit('[^,]*,[^,]*\K,', perl=TRUE)) %>% 
  unnest(everything())

输出：

# A tibble: 3 x 2
  Lotes NoModuloPlastico
  <chr> <chr>           
1 LOT1  71,72           
2 LOT2  80,81           
3 LOT3  102,100

注释:

strsplit(Lotes, ',') 用逗号

Lotes

strsplit('[^,]*,[^,]*\K,', perl=TRUE) 用逗号分隔 NoModuloPlastico 列。 [^,]*,[^,]* 匹配零个或多个非逗号字符，一个逗号和零个或多个非逗号字符，\K 省略这些匹配的字符，然后 , 匹配一个用于拆分的逗号带.

Answer 4

您可以将每两次出现的逗号替换为分号（或任何其他分隔符），将 NoModuloPlastico 中的逗号值也更改为分号并使用 separate_rows.

library(dplyr)
library(tidyr)

df_example %>%
  mutate(NoModuloPlastico = gsub('(,.*?),', '\1;', NoModuloPlastico), 
         Lotes = gsub(',', ';', Lotes, fixed = TRUE)) %>%
  separate_rows(Lotes, NoModuloPlastico, sep = ';')

#  Lotes NoModuloPlastico
#  <chr> <chr>           
#1 LOT1  71,72           
#2 LOT2  80,81           
#3 LOT3  102,100

Answer 5

您也可以使用以下解决方案：

library(dplyr)
library(tidyr)

df_example %>%
  mutate(NoModuloPlastico = paste0(regmatches(NoModuloPlastico, gregexpr("\d+,\d+", NoModuloPlastico))[[1]],
                                   collapse = " "), 
         Lotes = gsub(",", " ", Lotes)) %>%
  separate_rows(everything(), sep = "\s+")

# A tibble: 3 x 2
  Lotes NoModuloPlastico
  <chr> <chr>           
1 LOT1  71,72           
2 LOT2  80,81           
3 LOT3  102,100

正则表达式查找每两个逗号，使用 dplyr 将行与列分开

regex to find every two commas, to separate rows from a column using dplyr

regex

r

separator

dataframe

dplyr