将字符串分隔成行，除非在分隔符组之间

Question

我有带注释符号的话语：

utt <- c("↑hey girls↑ can I <join yo:u>", "((v: grunts))", "!damn shit! got it", 
"I mean /yeah we saw each other at a party:/↓ the other day"
)

我需要将 utt 拆分成单独的单词除非单词被某些分隔符括起来，包括这个 class [(/≈↑£<>°!]。我对 utts 使用 double negative lookahead 做得相当好，其中只有 one 分隔符之间出现这样的字符串；但我无法正确拆分分隔符之间有多个这样的字符串：

library(tidyr)
library(dplyr)
data.frame(utt2) %>%
  separate_rows(utt, sep = "(?!.*[(/≈↑£<>°!].*)\s(?!.*[)/≈↑£<>°!])")
# A tibble: 9 × 1
  utt2                                        
  <chr>                                       
1 ↑hey girls↑ can I <join yo:u>               
2 ((v: grunts))                               
3 !damn shit!                                 
4 got                                         
5 it                                          
6 I mean /yeah we saw each other at a party:/↓
7 the                                         
8 other                                       
9 day

预期结果将是：

1 ↑hey girls↑ 
2 can
3 I
4 <join yo:u>               
5 ((v: grunts))                               
6 !damn shit!                                 
7 got                                         
8 it                                          
9 I
10 mean 
11 /yeah we saw each other at a party:/↓
12 the                                         
13 other                                       
14 day

Answer 1

你可以使用

data.frame(utt2) %>% separate_rows(utt2, sep = "(?:([/≈↓£°!↑]).*?\1|\([^()]*\)|<[^<>]*>)(*SKIP)(*F)|\s+")

参见regex demo。

请注意，在您的情况下，有成对的字符（如 ( 和 )、< 和 >）和非成对的字符（如↑、£）。它们需要在模式中反映出不同的处理方式。

详情:

(?:([/≈↓£°!↑]).*?\1|\([^()]*\)|<[^<>]*>)(*SKIP)(*F) 匹配
- ([/≈↓£°!↑]).*?| - /、≈、↑、£、° 或 ! 字符捕获到第 1 组，然后除换行符之外的任何零个或多个字符尽可能少（参见 .*?），然后是捕获到第 1 组的相同字符
- \([^()]*\)| - (，除 ( 和 ) 之外的零个或多个字符，然后是 ) 个字符，或
- <[^<>]*> - <，除 < 和 > 之外的零个或多个字符，然后是 > 字符
- (*SKIP)(*F) - 跳过匹配的文本，从失败的位置重新开始搜索
| - 或
\s+ - 任何其他上下文中的一个或多个空格。

将字符串分隔成行，除非在分隔符组之间

Separate strings into rows unless between sets of delimiters

regex

r

tidyr