如何加速 R 中的 while 循环（可能使用 dopar）？

Question

我正在尝试处理一个包含数千万行文本的巨大文本文件。文本文件包含数百万张图像的卷积网络分析结果，如下所示：

 CUDNN_HALF=1 
net.optimized_memory = 0 
mini_batch = 1, batch = 8, time_steps = 1, train = 0 
nms_kind: greedynms (1), beta = 0.600000 
nms_kind: greedynms (1), beta = 0.600000 
nms_kind: greedynms (1), beta = 0.600000 

 seen 64, trained: 447 K-images (6 Kilo-batches_64) 
Enter Image Path: data/obj1/H001683-19-1-5-OCT2 [x=13390,y=52118,w=256,h=256].png: Predicted in 19.894000 milli-seconds.
tumor: 99%  (left_x:    2   top_y:  160   width:   67   height:   34)
bcell: 98%  (left_x:    6   top_y:   54   width:   32   height:   22)
bcell: 80%  (left_x:   51   top_y:    0   width:   30   height:   16)
bcell: 98%  (left_x:   52   top_y:  198   width:   28   height:   26)
bcell: 98%  (left_x:  150   top_y:  216   width:   35   height:   23)
bcell: 56%  (left_x:  150   top_y:   78   width:   45   height:   30)
bcell: 91%  (left_x:  187   top_y:  132   width:   31   height:   26)
bcell: 96%  (left_x:  219   top_y:  185   width:   20   height:   26)
bcell: 37%  (left_x:  222   top_y:   -0   width:   24   height:    4)
bcell: 98%  (left_x:  241   top_y:  208   width:   15   height:   21)
bcell: 64%  (left_x:  248   top_y:   35   width:    8   height:   35)
 [... a lot of similar lines...] 
Enter Image Path: data/obj1/H001683-19-1-5-OCT2 [x=13390,y=52530,w=256,h=256].png: Predicted in 19.195000 milli-seconds.
bcell: 97%  (left_x:   45   top_y:  180   width:   29   height:   24)
bcell: 58%  (left_x:   59   top_y:    1   width:   35   height:   22)
tumor: 98%  (left_x:  105   top_y:  143   width:   99   height:   44)
tumor: 97%  (left_x:  113   top_y:   50   width:   57   height:   40)
bcell: 96%  (left_x:  191   top_y:  194   width:   29   height:   27)
bcell: 99%  (left_x:  201   top_y:  129   width:   34   height:   22)
Enter Image Path:

每张图片都在“输入图片路径”之后以图片文件名提及，后跟已识别的对象列表。我不知道每张图像中有多少对象（这里是肿瘤和 bcell）。有时根本没有物体，有时有数百个物体。我首先尝试使用

读取整个文件

test11<-readLines("result.txt")
picsna<-grep(test11,pattern="Enter Image") # line numbers with the image file name
lle<-length(picsna) # length for the subsequent script

然后继续我的脚本，但事实证明读取文件需要几个小时，所以我想到了逐行读取文件并使用 while-循环：

require(LaF)
n=1 
lle<-0 # number of images (to be used in a subsequent code) 
picsna<-c() # vector with the line numbers of each image entry

# read the result-file initially (first bunch of lines do not contain image entries
test11<-get_lines(file="result.txt", line_numbers=n) 
# as long as the line exists read the next line and do following:
while(is.na(test11)==FALSE){ 
  test11<-get_lines(file="result.txt", line_numbers=n+1)
# I wanted to know how far my reading progressed but had a feeling, print slowed down the loop
  #print(n)   
# I found here this solution for printing progress periodically 
  if(n %% 10000==0) { 
     cat(paste0("iteration: ", n, "\n"))
  }
# look for image entry and save the line number (not the iteration number)
  if(grepl(test11,pattern="Enter Image")==TRUE){ 
    picsna<-c(picsna,n+1)
    lle<-lle+1} # increase the number of images
  n<-n+1 
}
# the last line of the file is always incomplete but has to be added to the vector to calculate the number of objects (in a following script not shown here) if the previous image had any.
if(is.na(test11)==TRUE){ 
  picsna<-c(picsna,n)
  print("The End")
  lle<-lle+1
}

我在一个包含大约 200 行的小结果文件上测量了第一个和第二个脚本的运行时间。第二个脚本甚至有点慢（0.04 对 0.01），这让我很困惑。我考虑过在 foreach-%dopar%-loop 中重写它，但无法意识到如何使用 readLines-function 或我的 while-loop 来实现它。我的问题是，我不知道文件包含多少行。如果有人可以帮助我并行化我的脚本，我将不胜感激！

Answer 1

谢谢@Bas！我在 Linux 机器上测试了你的建议：对于一个包含约 2.39 亿行的文件，它花费了不到 1 分钟。通过添加 >lines.txt 我可以保存结果。有趣的是，我的第一个 readLines R 脚本“仅”需要 29 分钟，与我的第一次体验相比，这快得惊人（所以我的 Windows 工作电脑可能遇到了一些问题，这与R).

如何加速 R 中的 while 循环（可能使用 dopar）？

How to speed up a while-loop in R (perhaps using dopar)?

parallel-processing

loops

r

while-loop

doparallel