Sparklyr、dplyr、正则表达式从文本变量中提取模式,然后用分号分隔
Sparlyr, dplyr, regex extract pattern form a text variable then separated with semicolon
我正在使用 sparklyr 和 dplyr,我一直在尝试创建一个变量,extract_code,它将从文本变量中提取特定模式。
图案是3个字母+3个数字。该模式可以在同一文本中出现多次。
在这种情况下,我希望模式用分号分隔
我已经使用正则表达式创建了这个对象:
regex_pattern <- "[A-Za-z]{3}[0-9]{3}"
这里有:
test <- data.table(id = 1:3, text= c("(table 012 APM325)", "(JUI524 toto KIO879)" , "(pink car in the field KJU547 MPO362/JHY879)"))
这是我想要的:
test <- data.table(id = 1:3, text= c("(table 012 APM325)", "(JUI524 toto KIO879)" , "(pink car in the field KJU547 MPO362/JHY879)"), extract_code =c( "APM325", "JUI524;KIO879" , "KJU547;MPO362;JHY879"))
我试过这个:
test <- test %>% mutate(extract_code = regexp_extract(text, regex_pattern, 0))
data.table(id = 1:3, text= c("(table 012 APM325)", "(JUI524 toto KIO879)" , "(pink car in the field KJU547 MPO362/JHY879)"), extract_code =c( "APM325", "JUI524" , "KJU547"))
但是我只得到第一个模式
你有什么建议吗?谢谢!
编辑:这行得通!
try <- data.table(id = 1:3, text= c("(table 012 APM325)", "(JUI524 toto KIO879)" , "(pink car in the field KJU547 MPO362/JHY879)"))
sdf_try <- copy_to(sc, try , "try" )
extract.pattern <- function(pat) function(df) {
f <- function(vec) sapply(regmatches(vec, gregexpr(pat, vec)), paste0, collapse = ";")
dplyr::mutate(df, extract_code = f(text))
}
sdf_try %>%
spark_apply(extract.pattern("[A-Z]{3}[0-9]{3}"))
但这不起作用:
regex_pattern <- "[A-Z]{3}[0-9]{3}"
sdf_try %>%
spark_apply(extract.pattern(regex_pattern))
# Error: org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 8.0 failed 4 times, most recent failure: Lost task 0.3 in stage 8.0 Exception: sparklyr worker rscript failure with status 255, check worker logs for details.
sdf_try %>%
spark_apply(extract.pattern('regex_pattern'))
regex_pattern <- "[A-Z]{3}[0-9]{3}"
test %>% mutate(extract_code = sapply(regmatches(text, gregexpr(regex_pattern,text)), paste0, collapse = ";"))
# id text extract_code
#1 1 (table 012 APM325) APM325
#2 2 (JUI524 toto KIO879) JUI524;KIO879
#3 3 (pink car in the field KJU547 MPO362/JHY879) KJU547;MPO362;JHY879
我已将 [A-Za-z]
更改为 [A-Z]
。如果这对您不起作用,请更正。在示例中确实如此。
regmatches
returns 匹配列表。然后我将它们折叠成由 ;
.
分隔的单个字符串
我正在使用 sparklyr 和 dplyr,我一直在尝试创建一个变量,extract_code,它将从文本变量中提取特定模式。 图案是3个字母+3个数字。该模式可以在同一文本中出现多次。 在这种情况下,我希望模式用分号分隔
我已经使用正则表达式创建了这个对象:
regex_pattern <- "[A-Za-z]{3}[0-9]{3}"
这里有:
test <- data.table(id = 1:3, text= c("(table 012 APM325)", "(JUI524 toto KIO879)" , "(pink car in the field KJU547 MPO362/JHY879)"))
这是我想要的:
test <- data.table(id = 1:3, text= c("(table 012 APM325)", "(JUI524 toto KIO879)" , "(pink car in the field KJU547 MPO362/JHY879)"), extract_code =c( "APM325", "JUI524;KIO879" , "KJU547;MPO362;JHY879"))
我试过这个:
test <- test %>% mutate(extract_code = regexp_extract(text, regex_pattern, 0))
data.table(id = 1:3, text= c("(table 012 APM325)", "(JUI524 toto KIO879)" , "(pink car in the field KJU547 MPO362/JHY879)"), extract_code =c( "APM325", "JUI524" , "KJU547"))
但是我只得到第一个模式
你有什么建议吗?谢谢!
编辑:这行得通!
try <- data.table(id = 1:3, text= c("(table 012 APM325)", "(JUI524 toto KIO879)" , "(pink car in the field KJU547 MPO362/JHY879)"))
sdf_try <- copy_to(sc, try , "try" )
extract.pattern <- function(pat) function(df) {
f <- function(vec) sapply(regmatches(vec, gregexpr(pat, vec)), paste0, collapse = ";")
dplyr::mutate(df, extract_code = f(text))
}
sdf_try %>%
spark_apply(extract.pattern("[A-Z]{3}[0-9]{3}"))
但这不起作用:
regex_pattern <- "[A-Z]{3}[0-9]{3}"
sdf_try %>%
spark_apply(extract.pattern(regex_pattern))
# Error: org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 8.0 failed 4 times, most recent failure: Lost task 0.3 in stage 8.0 Exception: sparklyr worker rscript failure with status 255, check worker logs for details.
sdf_try %>%
spark_apply(extract.pattern('regex_pattern'))
regex_pattern <- "[A-Z]{3}[0-9]{3}"
test %>% mutate(extract_code = sapply(regmatches(text, gregexpr(regex_pattern,text)), paste0, collapse = ";"))
# id text extract_code
#1 1 (table 012 APM325) APM325
#2 2 (JUI524 toto KIO879) JUI524;KIO879
#3 3 (pink car in the field KJU547 MPO362/JHY879) KJU547;MPO362;JHY879
我已将
[A-Za-z]
更改为[A-Z]
。如果这对您不起作用,请更正。在示例中确实如此。regmatches
returns 匹配列表。然后我将它们折叠成由;
. 分隔的单个字符串