使用 cv.glmnet 并行设置种子在 R 中给出不同的结果
Set seed with cv.glmnet paralleled gives different results in R
我运行宁并行cv.glmnet
来自glmnet
包超过1000个数据集。在每个 运行 中,我设置种子以使结果可重现。我注意到我的结果不同。问题是,当我在同一天 运行 代码时,结果是相同的。但是第二天就不同了。
这是我的代码:
model <- function(path, file, wyniki, faktor = 0.75) {
set.seed(2)
dane <- read.csv(file)
n <- nrow(dane)
podzial <- 1:floor(faktor*n)
########## GLMNET ############
nFolds <- 3
train_sparse <- dane[podzial,]
test_sparse <- dane[-podzial,]
# fit with cross-validation
tryCatch({
wart <- c(rep(0,6), "nie")
model <- cv.glmnet(train_sparse[,-1], train_sparse[,1], nfolds=nFolds, standardize=FALSE)
pred <- predict(model, test_sparse[,-1], type = "response",s=model$lambda.min)
# fetch of AUC value
aucp1 <- roc(test_sparse[,1],pred)$auc
}, error = function(e) print("error"))
results <- data.frame(auc = aucp1, n = nrow(dane))
write.table(results, wyniki, sep=',', append=TRUE,row.names =FALSE,col.names=FALSE)
}
path <- path_to_files
files <- list.files(sciezka, full.names = TRUE, recursive = TRUE)
wyniki <- "wyniki_adex__samplingfalse_decl_201512.csv"
library('doSNOW')
library('parallel')
#liczba watkow
threads <- 5
#rejestrujemy liczbe watkow
cl <- makeCluster(threads, outfile="")
registerDoSNOW(cl)
message("Loading packages on threads...")
clusterEvalQ(cl,library(pROC))
clusterEvalQ(cl,library(ROCR))
clusterEvalQ(cl,library(glmnet))
clusterEvalQ(cl,library(stringi))
message("Modelling...")
foreach(i=1:length(pliki)) %dopar% {
print(i)
model(path, files[i], wyniki)
}
有人知道是什么原因吗?
我正在 运行ning CentOS Linux 版本 7.0.1406(核心)/Red Hat 4.8.2-16
根据 Writing R Extensions,需要一个 C 包装器才能从 FORTRAN 调用 R 的正常随机数。我在 glmnet
源代码中没有看到任何 C 代码。恐怕它看起来没有实现:
在cv.glmnet
函数的文档中找到答案:
Note also that the results of cv.glmnet are random, since the folds
are selected at random.
解决办法是手动设置折叠,这样就不会乱选:
nFolds <- 3
foldid <- sample(rep(seq(nFolds), length.out = nrow(train_sparse))
model <- cv.glmnet(x = as.matrix(x = train_sparse[,-1],
y = train_sparse[,1],
nfolds = nFolds,
foldid = foldid,
standardize = FALSE)
我运行宁并行cv.glmnet
来自glmnet
包超过1000个数据集。在每个 运行 中,我设置种子以使结果可重现。我注意到我的结果不同。问题是,当我在同一天 运行 代码时,结果是相同的。但是第二天就不同了。
这是我的代码:
model <- function(path, file, wyniki, faktor = 0.75) {
set.seed(2)
dane <- read.csv(file)
n <- nrow(dane)
podzial <- 1:floor(faktor*n)
########## GLMNET ############
nFolds <- 3
train_sparse <- dane[podzial,]
test_sparse <- dane[-podzial,]
# fit with cross-validation
tryCatch({
wart <- c(rep(0,6), "nie")
model <- cv.glmnet(train_sparse[,-1], train_sparse[,1], nfolds=nFolds, standardize=FALSE)
pred <- predict(model, test_sparse[,-1], type = "response",s=model$lambda.min)
# fetch of AUC value
aucp1 <- roc(test_sparse[,1],pred)$auc
}, error = function(e) print("error"))
results <- data.frame(auc = aucp1, n = nrow(dane))
write.table(results, wyniki, sep=',', append=TRUE,row.names =FALSE,col.names=FALSE)
}
path <- path_to_files
files <- list.files(sciezka, full.names = TRUE, recursive = TRUE)
wyniki <- "wyniki_adex__samplingfalse_decl_201512.csv"
library('doSNOW')
library('parallel')
#liczba watkow
threads <- 5
#rejestrujemy liczbe watkow
cl <- makeCluster(threads, outfile="")
registerDoSNOW(cl)
message("Loading packages on threads...")
clusterEvalQ(cl,library(pROC))
clusterEvalQ(cl,library(ROCR))
clusterEvalQ(cl,library(glmnet))
clusterEvalQ(cl,library(stringi))
message("Modelling...")
foreach(i=1:length(pliki)) %dopar% {
print(i)
model(path, files[i], wyniki)
}
有人知道是什么原因吗? 我正在 运行ning CentOS Linux 版本 7.0.1406(核心)/Red Hat 4.8.2-16
根据 Writing R Extensions,需要一个 C 包装器才能从 FORTRAN 调用 R 的正常随机数。我在 glmnet
源代码中没有看到任何 C 代码。恐怕它看起来没有实现:
在cv.glmnet
函数的文档中找到答案:
Note also that the results of cv.glmnet are random, since the folds are selected at random.
解决办法是手动设置折叠,这样就不会乱选:
nFolds <- 3
foldid <- sample(rep(seq(nFolds), length.out = nrow(train_sparse))
model <- cv.glmnet(x = as.matrix(x = train_sparse[,-1],
y = train_sparse[,1],
nfolds = nFolds,
foldid = foldid,
standardize = FALSE)