如何在匹配短语后从单词 docx 中提取文本子集

How to subset text from a word docx AFTER a matching phrase

我想将原始单词 docx ("original.docx") 中的文本子集化为新单词 docx ("desired.docx"),在匹配短语“Drop Text Before”之后这里”,但保留原始格式(对于保留的文本)。

我修改了 {officer} 软件包文档中的示例 body_remove() 以显示原始结果和所需结果(以 docx 格式)。不同的是,文档中的示例保留了之前的部分文本,而我想保留匹配短语之后的文本。

library(officer)

# Original text
str1 <- rep("Lorem ipsum dolor sit amet, consectetur adipiscing elit. ", 3)
str1 <- paste(str1, collapse = "")

str2 <- "Drop Text Before Here"

str3 <- rep("Aenean venenatis varius elit et fermentum vivamus vehicula. ", 3)
str3 <- paste(str3, collapse = "")

# Create original_docx prior to subset
original_docx <- read_docx()
original_docx <- body_add_par(original_docx, value = str1, style = "Normal")
original_docx <- body_add_par(original_docx, value = str2, style = "centered")
original_docx <- body_add_par(original_docx, value = str3, style = "Normal")

# Save original docx in local directory
print(original_docx, "original.docx")

# Desired docx after subset starting at "Drop Text Before Here"
desired_docx <- read_docx()
desired_docx <- body_add_par(desired_docx, value = str2, style = "centered")
desired_docx <- body_add_par(desired_docx, value = str3, style = "Normal")

# Save desired docx in local directory
print(desired_docx, "desired.docx")

reprex package (v2.0.1)

于 2022-04-09 创建

您可以使用自定义函数尝试从当前光标位置向后遍历文档,在每一步中删除正文并在表示文档开头的错误处停止。

body_remove_before_cursor <- function(x) {
  tryCatch(
    {
      x <- officer::cursor_backward(x)
      x <- officer::body_remove(x)
      body_remove_before_cursor(x)
    },
    error = function(e) { 
      return(x)
    }
  )
}

desired_2_docx <- read_docx('original.docx')
desired_2_docx <- cursor_reach(desired_2_docx, str2)
desired_2_docx <- body_remove_before_cursor(desired_2_docx)
print(desired_2_docx, 'desired_2.docx')