R - lapply 函数中的 stringer 和 grep 下标越界

Question

我的目标是在匹配特定模式后从多个文件夹中的多个文本文件中提取固定长度（8 位）的数字串。

我花了一整天的时间来构建一个 lapply-function，所以子目录中的所有文件（最多 20 个）都可以自动处理。我虽然失败了。工作流的代码是可以执行的，但是由于本人知识匮乏R 仅限于一个文件。

在带有数字的行之间，每个文件有一个字符串，每个字符串都不同，我想提取它。字符串提取的输出应按文件夹存储。

字符串具有以下结构：String[one or two digits]_[eight digits] 。例如，String1_20220101 或 String12_20220108。我想提取下划线后面的部分。

文本文件以这种方式构建，每个超过 10000 行。

文件 1 的示例：

     X1  X2
1 1000 100
2 1050 100
3 1100 100
4 1150 100
5 1200 100
6 String1_20220101
7 1250 100
8 1300 100
9 1350 100
10 1400 100

x1 <- list(c(seq(1000,1400, by=50)))
[1] 1000 1050 1100 1150 1200 1250 1300 1350 1400

x2 <- list(c(rep(100, 9)))
[1] 100 100 100 100 100 100 100 100 100

文件 2：

   x1     x2
1 2000  200
2 3000  200
3 4000  200
4 5000  200
5 6000  200
6 7000  200
7 String12_20220108
8 8000  200
9 9000  200
10 10000 200


x1 <- list(c(seq(1000,10000,by=1000)))
[1]  1000  2000  3000  4000  5000  6000  7000  8000  9000 10000

x2 <- list(c(rep(200, 9)))
[1] 200 200 200 200 200 200 200 200 200

这些文件位于编号的文件夹中，它们的名称来自文件夹编号，属于一项观察。

我的文件夹 1 代码：

library(stringr)

Folderno1 <- list.files(path = "path/to/file/1/",
pattern = "*.txt",
full.names = TRUE)

FUN <- function(Folder1) {
folder_input <- readLines(Folderno1)
string <- grep("String[0-9]_", folder_input, value = TRUE)
output <- capture.output(as.numeric(str_extract_all(string, "(?<=[0-9]{1,2}_)[0-9]+")[[1]]))
write(output, file="/pathtofile/String1.tex")
}

lapply(Folderno1, FUN)

Error in str_extract_all(string, "(?<=[0-9]{1,2}_)[0-9]+")[[1]] : 
subscript out of bounds

出现以上错误信息。尽管有错误信息，文件 String1.tex 可以被覆盖，但只有一个结果：

[1] 20220101

重新运行调试显示：

function (x) 
.Internal(withVisible(x))

能否指导我如何成功更改工作流程，以便处理每个文件？我无法理解它。

谢谢。

Answer 1

您在函数中每次 (write(output, file="/pathtofile/String1.tex")) 都在覆盖同一个文件。可能，您想为每个 .txt 个文件创建一个新的 .tex 文件。

根据错误消息，我认为某些文件没有我们正在寻找的模式 (String[0-9]_)。 String[0-9]_ 不适用于 String12_20220108 等 2 位数字。我已将其更改为使用 String[0-9]+_。为了安全起见，我还添加了一个 if 条件来检查输出的长度。

试试这个解决方案 -

Folderno1 <- list.files(path = "path/to/file/1/",
                        pattern = "*.txt",
                        full.names = TRUE)

FUN <- function(Folder1) {
  #Read the file
  folder_input <- readLines(Folder1)
  #Extract the line which has "String" in it
  string <- grep("String[0-9]+_", folder_input, value = TRUE)
  #If such line exists
  if(length(string)) {
    #Remove everything till underscore to get 8-digit number
    output <- sub('.*_', '', string)
    #Remove everything after underscore to get "String1", "String12"
    out <- sub('_.*', '', string)
    #Write the output
    write(output, file= paste0('/pathtofile/', out, '.tex'))
  }
}

lapply(Folderno1, FUN)

R - lapply 函数中的 stringer 和 grep 下标越界

R - subscript out of bounds with stringer and grep in lapply function

r

subscript

lapply

stringr