正则表达式查找每两个逗号,使用 dplyr 将行与列分开
regex to find every two commas, to separate rows from a column using dplyr
我有这样的字符串:
71,72,80,81,102,100
我想把每两个“数字”分开,所以:
71,72
80,81
102,100
我写了这个正则表达式:
(([0-9]{1,4}),([0-9]{1,4}))
其中突出显示了我需要的组,除了“,”之间的逗号外,“
在我的代码中我使用 dplyr
示例:
df_example <- tibble(Lotes= "LOT1,LOT2,LOT3",NoModuloPlastico = "71,72,80,81,102,100")
df_result_example <- df_example %>%
separate_rows(c(Lotes),sep=",") %>%
separate_rows(c(NoModuloPlastico),sep="(([0-9]{1,3}),([0-9]{1,3}))")
这意味着我真正需要的是用正则表达式突出显示每 2 个逗号,但我找不到方法。
我无法根据自己的需要调整这些链接:
https://bedigit.com/blog/regex-how-to-match-everything-except-a-particular-pattern/
https://blog.codinghorror.com/excluding-matches-with-regular-expressions/
我得到的:
Lotes
NoModuloPlastico
LOT1
""
LOT1
","
LOT1
","
LOT1
""
LOT2
""
LOT2
","
LOT2
","
LOT2
""
LOT3
""
LOT3
","
LOT3
","
LOT3
""
我想要的:
Lotes
NoModuloPlastico
LOT1
71,72
LOT2
80,81
LOT3
102,100
你可以这样做:
df_example %>%
mutate(Lotes = str_split(Lotes, ','),
NoModuloPlastico = NoModuloPlastico %>%
str_replace_all('([^,]+,[^,]+),', '\1:') %>%
str_split(':')) %>%
unnest(everything())
# A tibble: 3 x 2
Lotes NoModuloPlastico
<chr> <chr>
1 LOT1 71,72
2 LOT2 80,81
3 LOT3 102,100
我不知道这是否概括了您的问题,如果没有,请尝试找到此方法的变体。
# Your tibble
df_example = dplyr::tibble(Lotes= "LOT1,LOT2,LOT3",NoModuloPlastico = "71,72,80,81,102,100")
# with strsplit separate the strings by "," and turn them into a matrix
A = matrix(strsplit(df_example$Lotes, split = ",")[[1]],ncol=1,)
B = matrix(strsplit(df_example$NoModuloPlastico, split = ",")[[1]], ncol = 2)
# cbind the two matrices, turn that into a dataframe and give names to the columns
C = as.data.frame(cbind(A,B))
colnames(C) = c("Lotes", "Modulo1", "Modulo2")
# Create your new column with paste0() function
C$NoModuloPlastico = paste0(C$Modulo1, ",",C$Modulo2)
# This is extra but only for following your variable name create that with the two columns.
df_result_example = data.frame(C$Lotes, C$NoModuloPlastico)
这解决了您在 base R 中的示例。
你可以使用稍微缩短的 :
df_example %>%
mutate(Lotes = strsplit(Lotes, ','),
NoModuloPlastico = NoModuloPlastico %>%
strsplit('[^,]*,[^,]*\K,', perl=TRUE)) %>%
unnest(everything())
输出:
# A tibble: 3 x 2
Lotes NoModuloPlastico
<chr> <chr>
1 LOT1 71,72
2 LOT2 80,81
3 LOT3 102,100
注释:
strsplit(Lotes, ',')
用逗号 拆分 Lotes
列
strsplit('[^,]*,[^,]*\K,', perl=TRUE)
用逗号分隔 NoModuloPlastico
列。 [^,]*,[^,]*
匹配零个或多个非逗号字符,一个逗号和零个或多个非逗号字符,\K
省略这些匹配的字符,然后 ,
匹配一个用于拆分的逗号带. 的字符串
您可以将每两次出现的逗号替换为分号(或任何其他分隔符),将 NoModuloPlastico
中的逗号值也更改为分号并使用 separate_rows
.
library(dplyr)
library(tidyr)
df_example %>%
mutate(NoModuloPlastico = gsub('(,.*?),', '\1;', NoModuloPlastico),
Lotes = gsub(',', ';', Lotes, fixed = TRUE)) %>%
separate_rows(Lotes, NoModuloPlastico, sep = ';')
# Lotes NoModuloPlastico
# <chr> <chr>
#1 LOT1 71,72
#2 LOT2 80,81
#3 LOT3 102,100
您也可以使用以下解决方案:
library(dplyr)
library(tidyr)
df_example %>%
mutate(NoModuloPlastico = paste0(regmatches(NoModuloPlastico, gregexpr("\d+,\d+", NoModuloPlastico))[[1]],
collapse = " "),
Lotes = gsub(",", " ", Lotes)) %>%
separate_rows(everything(), sep = "\s+")
# A tibble: 3 x 2
Lotes NoModuloPlastico
<chr> <chr>
1 LOT1 71,72
2 LOT2 80,81
3 LOT3 102,100
我有这样的字符串:
71,72,80,81,102,100
我想把每两个“数字”分开,所以:
71,72
80,81
102,100
我写了这个正则表达式:
(([0-9]{1,4}),([0-9]{1,4}))
其中突出显示了我需要的组,除了“,”之间的逗号外,“
在我的代码中我使用 dplyr
示例:
df_example <- tibble(Lotes= "LOT1,LOT2,LOT3",NoModuloPlastico = "71,72,80,81,102,100")
df_result_example <- df_example %>%
separate_rows(c(Lotes),sep=",") %>%
separate_rows(c(NoModuloPlastico),sep="(([0-9]{1,3}),([0-9]{1,3}))")
这意味着我真正需要的是用正则表达式突出显示每 2 个逗号,但我找不到方法。
我无法根据自己的需要调整这些链接:
https://bedigit.com/blog/regex-how-to-match-everything-except-a-particular-pattern/
https://blog.codinghorror.com/excluding-matches-with-regular-expressions/
我得到的:
Lotes | NoModuloPlastico |
---|---|
LOT1 | "" |
LOT1 | "," |
LOT1 | "," |
LOT1 | "" |
LOT2 | "" |
LOT2 | "," |
LOT2 | "," |
LOT2 | "" |
LOT3 | "" |
LOT3 | "," |
LOT3 | "," |
LOT3 | "" |
我想要的:
Lotes | NoModuloPlastico |
---|---|
LOT1 | 71,72 |
LOT2 | 80,81 |
LOT3 | 102,100 |
你可以这样做:
df_example %>%
mutate(Lotes = str_split(Lotes, ','),
NoModuloPlastico = NoModuloPlastico %>%
str_replace_all('([^,]+,[^,]+),', '\1:') %>%
str_split(':')) %>%
unnest(everything())
# A tibble: 3 x 2
Lotes NoModuloPlastico
<chr> <chr>
1 LOT1 71,72
2 LOT2 80,81
3 LOT3 102,100
我不知道这是否概括了您的问题,如果没有,请尝试找到此方法的变体。
# Your tibble
df_example = dplyr::tibble(Lotes= "LOT1,LOT2,LOT3",NoModuloPlastico = "71,72,80,81,102,100")
# with strsplit separate the strings by "," and turn them into a matrix
A = matrix(strsplit(df_example$Lotes, split = ",")[[1]],ncol=1,)
B = matrix(strsplit(df_example$NoModuloPlastico, split = ",")[[1]], ncol = 2)
# cbind the two matrices, turn that into a dataframe and give names to the columns
C = as.data.frame(cbind(A,B))
colnames(C) = c("Lotes", "Modulo1", "Modulo2")
# Create your new column with paste0() function
C$NoModuloPlastico = paste0(C$Modulo1, ",",C$Modulo2)
# This is extra but only for following your variable name create that with the two columns.
df_result_example = data.frame(C$Lotes, C$NoModuloPlastico)
这解决了您在 base R 中的示例。
你可以使用稍微缩短的
df_example %>%
mutate(Lotes = strsplit(Lotes, ','),
NoModuloPlastico = NoModuloPlastico %>%
strsplit('[^,]*,[^,]*\K,', perl=TRUE)) %>%
unnest(everything())
输出:
# A tibble: 3 x 2
Lotes NoModuloPlastico
<chr> <chr>
1 LOT1 71,72
2 LOT2 80,81
3 LOT3 102,100
注释:
strsplit(Lotes, ',')
用逗号 拆分 strsplit('[^,]*,[^,]*\K,', perl=TRUE)
用逗号分隔NoModuloPlastico
列。[^,]*,[^,]*
匹配零个或多个非逗号字符,一个逗号和零个或多个非逗号字符,\K
省略这些匹配的字符,然后,
匹配一个用于拆分的逗号带. 的字符串
Lotes
列
您可以将每两次出现的逗号替换为分号(或任何其他分隔符),将 NoModuloPlastico
中的逗号值也更改为分号并使用 separate_rows
.
library(dplyr)
library(tidyr)
df_example %>%
mutate(NoModuloPlastico = gsub('(,.*?),', '\1;', NoModuloPlastico),
Lotes = gsub(',', ';', Lotes, fixed = TRUE)) %>%
separate_rows(Lotes, NoModuloPlastico, sep = ';')
# Lotes NoModuloPlastico
# <chr> <chr>
#1 LOT1 71,72
#2 LOT2 80,81
#3 LOT3 102,100
您也可以使用以下解决方案:
library(dplyr)
library(tidyr)
df_example %>%
mutate(NoModuloPlastico = paste0(regmatches(NoModuloPlastico, gregexpr("\d+,\d+", NoModuloPlastico))[[1]],
collapse = " "),
Lotes = gsub(",", " ", Lotes)) %>%
separate_rows(everything(), sep = "\s+")
# A tibble: 3 x 2
Lotes NoModuloPlastico
<chr> <chr>
1 LOT1 71,72
2 LOT2 80,81
3 LOT3 102,100