我想在 R 字符向量中的新行中查找短字符串

Question

我将几百个单词的文本读入 R（在 .txt 文件上使用 read_file）。文本的某些行在 \n 之前仅包含非常短的片段（例如 'Figure 1'）。我想用空白 \n 替换它们。所以，在下面，我想 gsub 去掉最后 3 行。我认为它们都将在 ~10 个字以内，并且 none 将有一个句点 . 除了可能在最后。全部以 \n.

开始和结束

Some are long. They might have short segments (like the preceding sentence), but they'll all be over some length, and will almost certainly have at least 2 sentence closings (abnormally long sentences aside). Others are short, like these:

Figure 1: description
  Materials and Methods
Introduction.

我试过：

gsub("\n(.{90,}[\.\?\:].*){2,}\n$", "\n", string1, perl=T)

还有 regex works IE。在换行符之后，我们希望在标点符号 (.?:) 之前出现一些字符（至少 50 个），并且我们希望该模式在下一个换行符之前至少重复两次。我想添加 (?gmi) 修饰符（至少，它在 regex101 中与它们一起工作），但我找不到如何在 R 中添加它们。我认为使用修饰符，上面的代码可以工作，其他选项（例如 gsub 在 \n (text) \n\ 上少于 90 个字符并且只有一个 ':.?' 或类似的东西也可能很有趣）。

更新我想我可以使用类似的东西： str_replace_all(test, regex("^\n(.{50,}[\.\?\:].*){2,}\n$", multiline = T), "\n") with stri_opts_regex from stringi 添加选项......但我不清楚如何（或者，它是否会起作用）。

Answer 1

感谢 Carlos 的评论，我放弃了正则表达式，只使用了 strsplit，比如

holding <- unlist(strsplit(y,"\n"))
holding <- lapply(holding, function (bits) ifelse(nchar(bits) < 75, "", ifelse(nchar(bits)<150, ifelse(sum(str_count(bits, "\."),str_count(bits, "\:"),str_count(bits, "\?"))<3, "", bits), bits)))
holding <- holding[holding != ""]; # without elements that are empty
#recombine that back into y
y <- paste(holding, collapse = "\n")

不是很优雅，但不需要 regex。

我想在 R 字符向量中的新行中查找短字符串

I want to find short strings in new lines in an R character vector

regex

r

gsub