如何加速 R 中的 while 循环(可能使用 dopar)?
How to speed up a while-loop in R (perhaps using dopar)?
我正在尝试处理一个包含数千万行文本的巨大文本文件。文本文件包含数百万张图像的卷积网络分析结果,如下所示:
CUDNN_HALF=1
net.optimized_memory = 0
mini_batch = 1, batch = 8, time_steps = 1, train = 0
nms_kind: greedynms (1), beta = 0.600000
nms_kind: greedynms (1), beta = 0.600000
nms_kind: greedynms (1), beta = 0.600000
seen 64, trained: 447 K-images (6 Kilo-batches_64)
Enter Image Path: data/obj1/H001683-19-1-5-OCT2 [x=13390,y=52118,w=256,h=256].png: Predicted in 19.894000 milli-seconds.
tumor: 99% (left_x: 2 top_y: 160 width: 67 height: 34)
bcell: 98% (left_x: 6 top_y: 54 width: 32 height: 22)
bcell: 80% (left_x: 51 top_y: 0 width: 30 height: 16)
bcell: 98% (left_x: 52 top_y: 198 width: 28 height: 26)
bcell: 98% (left_x: 150 top_y: 216 width: 35 height: 23)
bcell: 56% (left_x: 150 top_y: 78 width: 45 height: 30)
bcell: 91% (left_x: 187 top_y: 132 width: 31 height: 26)
bcell: 96% (left_x: 219 top_y: 185 width: 20 height: 26)
bcell: 37% (left_x: 222 top_y: -0 width: 24 height: 4)
bcell: 98% (left_x: 241 top_y: 208 width: 15 height: 21)
bcell: 64% (left_x: 248 top_y: 35 width: 8 height: 35)
[... a lot of similar lines...]
Enter Image Path: data/obj1/H001683-19-1-5-OCT2 [x=13390,y=52530,w=256,h=256].png: Predicted in 19.195000 milli-seconds.
bcell: 97% (left_x: 45 top_y: 180 width: 29 height: 24)
bcell: 58% (left_x: 59 top_y: 1 width: 35 height: 22)
tumor: 98% (left_x: 105 top_y: 143 width: 99 height: 44)
tumor: 97% (left_x: 113 top_y: 50 width: 57 height: 40)
bcell: 96% (left_x: 191 top_y: 194 width: 29 height: 27)
bcell: 99% (left_x: 201 top_y: 129 width: 34 height: 22)
Enter Image Path:
每张图片都在“输入图片路径”之后以图片文件名提及,后跟已识别的对象列表。我不知道每张图像中有多少对象(这里是肿瘤和 bcell)。有时根本没有物体,有时有数百个物体。
我首先尝试使用
读取整个文件
test11<-readLines("result.txt")
picsna<-grep(test11,pattern="Enter Image") # line numbers with the image file name
lle<-length(picsna) # length for the subsequent script
然后继续我的脚本,但事实证明读取文件需要几个小时,所以我想到了逐行读取文件并使用 while
-循环:
require(LaF)
n=1
lle<-0 # number of images (to be used in a subsequent code)
picsna<-c() # vector with the line numbers of each image entry
# read the result-file initially (first bunch of lines do not contain image entries
test11<-get_lines(file="result.txt", line_numbers=n)
# as long as the line exists read the next line and do following:
while(is.na(test11)==FALSE){
test11<-get_lines(file="result.txt", line_numbers=n+1)
# I wanted to know how far my reading progressed but had a feeling, print slowed down the loop
#print(n)
# I found here this solution for printing progress periodically
if(n %% 10000==0) {
cat(paste0("iteration: ", n, "\n"))
}
# look for image entry and save the line number (not the iteration number)
if(grepl(test11,pattern="Enter Image")==TRUE){
picsna<-c(picsna,n+1)
lle<-lle+1} # increase the number of images
n<-n+1
}
# the last line of the file is always incomplete but has to be added to the vector to calculate the number of objects (in a following script not shown here) if the previous image had any.
if(is.na(test11)==TRUE){
picsna<-c(picsna,n)
print("The End")
lle<-lle+1
}
我在一个包含大约 200 行的小结果文件上测量了第一个和第二个脚本的运行时间。第二个脚本甚至有点慢(0.04 对 0.01),这让我很困惑。
我考虑过在 foreach
-%dopar%
-loop 中重写它,但无法意识到如何使用 readLines
-function 或我的 while
-loop 来实现它。我的问题是,我不知道文件包含多少行。如果有人可以帮助我并行化我的脚本,我将不胜感激!
谢谢@Bas!我在 Linux 机器上测试了你的建议:对于一个包含约 2.39 亿行的文件,它花费了不到 1 分钟。通过添加 >lines.txt
我可以保存结果。有趣的是,我的第一个 readLines
R 脚本“仅”需要 29 分钟,与我的第一次体验相比,这快得惊人(所以我的 Windows 工作电脑可能遇到了一些问题,这与R).
我正在尝试处理一个包含数千万行文本的巨大文本文件。文本文件包含数百万张图像的卷积网络分析结果,如下所示:
CUDNN_HALF=1
net.optimized_memory = 0
mini_batch = 1, batch = 8, time_steps = 1, train = 0
nms_kind: greedynms (1), beta = 0.600000
nms_kind: greedynms (1), beta = 0.600000
nms_kind: greedynms (1), beta = 0.600000
seen 64, trained: 447 K-images (6 Kilo-batches_64)
Enter Image Path: data/obj1/H001683-19-1-5-OCT2 [x=13390,y=52118,w=256,h=256].png: Predicted in 19.894000 milli-seconds.
tumor: 99% (left_x: 2 top_y: 160 width: 67 height: 34)
bcell: 98% (left_x: 6 top_y: 54 width: 32 height: 22)
bcell: 80% (left_x: 51 top_y: 0 width: 30 height: 16)
bcell: 98% (left_x: 52 top_y: 198 width: 28 height: 26)
bcell: 98% (left_x: 150 top_y: 216 width: 35 height: 23)
bcell: 56% (left_x: 150 top_y: 78 width: 45 height: 30)
bcell: 91% (left_x: 187 top_y: 132 width: 31 height: 26)
bcell: 96% (left_x: 219 top_y: 185 width: 20 height: 26)
bcell: 37% (left_x: 222 top_y: -0 width: 24 height: 4)
bcell: 98% (left_x: 241 top_y: 208 width: 15 height: 21)
bcell: 64% (left_x: 248 top_y: 35 width: 8 height: 35)
[... a lot of similar lines...]
Enter Image Path: data/obj1/H001683-19-1-5-OCT2 [x=13390,y=52530,w=256,h=256].png: Predicted in 19.195000 milli-seconds.
bcell: 97% (left_x: 45 top_y: 180 width: 29 height: 24)
bcell: 58% (left_x: 59 top_y: 1 width: 35 height: 22)
tumor: 98% (left_x: 105 top_y: 143 width: 99 height: 44)
tumor: 97% (left_x: 113 top_y: 50 width: 57 height: 40)
bcell: 96% (left_x: 191 top_y: 194 width: 29 height: 27)
bcell: 99% (left_x: 201 top_y: 129 width: 34 height: 22)
Enter Image Path:
每张图片都在“输入图片路径”之后以图片文件名提及,后跟已识别的对象列表。我不知道每张图像中有多少对象(这里是肿瘤和 bcell)。有时根本没有物体,有时有数百个物体。 我首先尝试使用
读取整个文件test11<-readLines("result.txt")
picsna<-grep(test11,pattern="Enter Image") # line numbers with the image file name
lle<-length(picsna) # length for the subsequent script
然后继续我的脚本,但事实证明读取文件需要几个小时,所以我想到了逐行读取文件并使用 while
-循环:
require(LaF)
n=1
lle<-0 # number of images (to be used in a subsequent code)
picsna<-c() # vector with the line numbers of each image entry
# read the result-file initially (first bunch of lines do not contain image entries
test11<-get_lines(file="result.txt", line_numbers=n)
# as long as the line exists read the next line and do following:
while(is.na(test11)==FALSE){
test11<-get_lines(file="result.txt", line_numbers=n+1)
# I wanted to know how far my reading progressed but had a feeling, print slowed down the loop
#print(n)
# I found here this solution for printing progress periodically
if(n %% 10000==0) {
cat(paste0("iteration: ", n, "\n"))
}
# look for image entry and save the line number (not the iteration number)
if(grepl(test11,pattern="Enter Image")==TRUE){
picsna<-c(picsna,n+1)
lle<-lle+1} # increase the number of images
n<-n+1
}
# the last line of the file is always incomplete but has to be added to the vector to calculate the number of objects (in a following script not shown here) if the previous image had any.
if(is.na(test11)==TRUE){
picsna<-c(picsna,n)
print("The End")
lle<-lle+1
}
我在一个包含大约 200 行的小结果文件上测量了第一个和第二个脚本的运行时间。第二个脚本甚至有点慢(0.04 对 0.01),这让我很困惑。
我考虑过在 foreach
-%dopar%
-loop 中重写它,但无法意识到如何使用 readLines
-function 或我的 while
-loop 来实现它。我的问题是,我不知道文件包含多少行。如果有人可以帮助我并行化我的脚本,我将不胜感激!
谢谢@Bas!我在 Linux 机器上测试了你的建议:对于一个包含约 2.39 亿行的文件,它花费了不到 1 分钟。通过添加 >lines.txt
我可以保存结果。有趣的是,我的第一个 readLines
R 脚本“仅”需要 29 分钟,与我的第一次体验相比,这快得惊人(所以我的 Windows 工作电脑可能遇到了一些问题,这与R).