正则表达式查找每两个逗号,使用 dplyr 将行与列分开

regex to find every two commas, to separate rows from a column using dplyr

我有这样的字符串:

71,72,80,81,102,100

我想把每两个“数字”分开,所以:

71,72
80,81
102,100

我写了这个正则表达式:

(([0-9]{1,4}),([0-9]{1,4}))

其中突出显示了我需要的组,除了“,”之间的逗号外,“

在我的代码中我使用 dplyr

示例:

df_example <- tibble(Lotes= "LOT1,LOT2,LOT3",NoModuloPlastico = "71,72,80,81,102,100")

df_result_example <- df_example %>%
separate_rows(c(Lotes),sep=",") %>%
separate_rows(c(NoModuloPlastico),sep="(([0-9]{1,3}),([0-9]{1,3}))")

这意味着我真正需要的是用正则表达式突出显示每 2 个逗号,但我找不到方法。

我无法根据自己的需要调整这些链接:

https://bedigit.com/blog/regex-how-to-match-everything-except-a-particular-pattern/

https://blog.codinghorror.com/excluding-matches-with-regular-expressions/

我得到的:

Lotes NoModuloPlastico
LOT1 ""
LOT1 ","
LOT1 ","
LOT1 ""
LOT2 ""
LOT2 ","
LOT2 ","
LOT2 ""
LOT3 ""
LOT3 ","
LOT3 ","
LOT3 ""

我想要的:

Lotes NoModuloPlastico
LOT1 71,72
LOT2 80,81
LOT3 102,100

你可以这样做:

df_example %>%
  mutate(Lotes = str_split(Lotes, ','),
         NoModuloPlastico = NoModuloPlastico %>%
           str_replace_all('([^,]+,[^,]+),', '\1:') %>%
           str_split(':')) %>%
  unnest(everything())

# A tibble: 3 x 2
  Lotes NoModuloPlastico
  <chr> <chr>           
1 LOT1  71,72           
2 LOT2  80,81           
3 LOT3  102,100

我不知道这是否概括了您的问题,如果没有,请尝试找到此方法的变体。

# Your tibble
df_example  = dplyr::tibble(Lotes= "LOT1,LOT2,LOT3",NoModuloPlastico = "71,72,80,81,102,100")

# with strsplit separate the strings by "," and turn them into a matrix
A = matrix(strsplit(df_example$Lotes, split = ",")[[1]],ncol=1,)
B = matrix(strsplit(df_example$NoModuloPlastico, split = ",")[[1]], ncol = 2)

# cbind the two matrices, turn that into a dataframe and give names to the columns
C = as.data.frame(cbind(A,B))
colnames(C) = c("Lotes", "Modulo1", "Modulo2")

# Create your new column with paste0() function
C$NoModuloPlastico = paste0(C$Modulo1, ",",C$Modulo2)

# This is extra but only for following your variable name create that with the two columns.
df_result_example = data.frame(C$Lotes, C$NoModuloPlastico)

这解决了您在 base R 中的示例。

你可以使用稍微缩短的 :

df_example %>% 
  mutate(Lotes = strsplit(Lotes, ','),
    NoModuloPlastico = NoModuloPlastico %>% 
      strsplit('[^,]*,[^,]*\K,', perl=TRUE)) %>% 
  unnest(everything())

输出:

# A tibble: 3 x 2
  Lotes NoModuloPlastico
  <chr> <chr>           
1 LOT1  71,72           
2 LOT2  80,81           
3 LOT3  102,100 

注释:

  • strsplit(Lotes, ',') 用逗号
  • 拆分 Lotes
  • strsplit('[^,]*,[^,]*\K,', perl=TRUE) 用逗号分隔 NoModuloPlastico 列。 [^,]*,[^,]* 匹配零个或多个非逗号字符,一个逗号和零个或多个非逗号字符,\K 省略这些匹配的字符,然后 , 匹配一个用于拆分的逗号带.
  • 的字符串

您可以将每两次出现的逗号替换为分号(或任何其他分隔符),将 NoModuloPlastico 中的逗号值也更改为分号并使用 separate_rows.

library(dplyr)
library(tidyr)

df_example %>%
  mutate(NoModuloPlastico = gsub('(,.*?),', '\1;', NoModuloPlastico), 
         Lotes = gsub(',', ';', Lotes, fixed = TRUE)) %>%
  separate_rows(Lotes, NoModuloPlastico, sep = ';')

#  Lotes NoModuloPlastico
#  <chr> <chr>           
#1 LOT1  71,72           
#2 LOT2  80,81           
#3 LOT3  102,100         

您也可以使用以下解决方案:

library(dplyr)
library(tidyr)

df_example %>%
  mutate(NoModuloPlastico = paste0(regmatches(NoModuloPlastico, gregexpr("\d+,\d+", NoModuloPlastico))[[1]],
                                   collapse = " "), 
         Lotes = gsub(",", " ", Lotes)) %>%
  separate_rows(everything(), sep = "\s+")

# A tibble: 3 x 2
  Lotes NoModuloPlastico
  <chr> <chr>           
1 LOT1  71,72           
2 LOT2  80,81           
3 LOT3  102,100