R语言unnest函数根据正则拆分某列excel个单元格,得到重复结果
Using unnest function in R language to split excel cells of a certain column based on regex, getting repetitive results
我有一个问题,目前我正在尝试拆分 excel 文件中的单元格,如果该行以数字开头(加号表示两位数),我可以拆分每一行) 并且数字后跟句号“.”,即由在此函数中定义的正则表达式定义。但是,当实际拆分发生时,输出是基于其他列的重复(循环)。
在这里你可以找到我的input data, current output and this is the desired output。
# Load libraries
library('tidyverse')
library('readxl')
library('openxlsx')
# Set functions
do_split = function(x, pattern = "\d+\.\s{1}"){
if( is_tibble(x) ){ x = pull(x) }
num_bullets = x %>% str_extract_all("\d+\. ") %>% unlist
x %>% str_split(pattern) %>% unlist %>% .[.!=""] %>% str_c(num_bullets,.) %>% list %>% return
}
# Read data
df = read_excel(path = '~/Desktop/master.xlsx')
# Wrangle data
o = df %>%
mutate(Result = Result %>% do_split, Steps = Steps %>% do_split) %>%
unnest(Result, Steps)
# Output file
write.xlsx(x = o, file = “out.xlsx”)
使用 rowwise()
以便您的 mutate 命令一次在一行上使用...
df %>%
rowwise() %>%
mutate_at(vars(Result, Steps), funs(do_split(.))) %>%
unnest()
如果您不需要分隔符(在您的情况下是前导数字,例如“1.”),tidyr::separate_rows()
可能是 easier/cleaner...
df %>%
separate_rows(Result, Steps, sep = "\d+\. ") %>%
filter(Result != "")
我有一个问题,目前我正在尝试拆分 excel 文件中的单元格,如果该行以数字开头(加号表示两位数),我可以拆分每一行) 并且数字后跟句号“.”,即由在此函数中定义的正则表达式定义。但是,当实际拆分发生时,输出是基于其他列的重复(循环)。
在这里你可以找到我的input data, current output and this is the desired output。
# Load libraries
library('tidyverse')
library('readxl')
library('openxlsx')
# Set functions
do_split = function(x, pattern = "\d+\.\s{1}"){
if( is_tibble(x) ){ x = pull(x) }
num_bullets = x %>% str_extract_all("\d+\. ") %>% unlist
x %>% str_split(pattern) %>% unlist %>% .[.!=""] %>% str_c(num_bullets,.) %>% list %>% return
}
# Read data
df = read_excel(path = '~/Desktop/master.xlsx')
# Wrangle data
o = df %>%
mutate(Result = Result %>% do_split, Steps = Steps %>% do_split) %>%
unnest(Result, Steps)
# Output file
write.xlsx(x = o, file = “out.xlsx”)
使用 rowwise()
以便您的 mutate 命令一次在一行上使用...
df %>%
rowwise() %>%
mutate_at(vars(Result, Steps), funs(do_split(.))) %>%
unnest()
如果您不需要分隔符(在您的情况下是前导数字,例如“1.”),tidyr::separate_rows()
可能是 easier/cleaner...
df %>%
separate_rows(Result, Steps, sep = "\d+\. ") %>%
filter(Result != "")