如何在两个特定字符之间拆分字符串 (R)

Question

我希望将一些抓取的期刊出版数据整齐地分成几列（即作者、标题、期刊等）。我大部分时间都这样做了，但是我被困在下面的条目中，该条目在标题中间输入了 \n 行。

structure(list(value = "               What wrist should you wear your actigraphy device on? Analysis of dominant vs.\n            non-dominant wrist actigraphy for measuring sleep in healthy adults. \n                     Sleep Science. \n                        10:132-135.\n             2017\n\n                 Full text if available"), row.names = c(NA, 
-1L), class = c("tbl_df", "tbl", "data.frame"))

为了解决这个问题，我不想简单地在 \n 行拆分，而是想在 \n 行和大写字母之间的位置拆分字符串（这样标题就不会拆分为两个单独的列）。

我在 \n 行分割的原始代码简单地使用：

str_split_fixed(x,"\n", 2)[ ,2]

我尝试了多种使用正则表达式的组合 lookahead/behind，但无法弄清楚如何拆分两个字符并将这些字符包含在两边。

Answer 1

您可以使用：

strsplit(df$value, '\n\s+(?=[A-Z])', perl = TRUE)

#[[1]]
#[1] "               What wrist should you wear your actigraphy device on? Analysis of dominant vs.\n            non-dominant wrist actigraphy for measuring sleep in healthy adults. "
#[2] "Sleep Science. \n                        10:132-135.\n             2017"                                                                                                         
#[3] "Full text if available"

这将在换行符处拆分文本，后跟一个或多个空格，再后跟一个大写字母。我们对大写字母使用正前瞻正则表达式，以便它保留在字符串中。

如何在两个特定字符之间拆分字符串 (R)

How to split strings between two specific characters (R)

r

strsplit