LSTM Keras 中的生成器函数,用于输出一个文件的小批量
Generator function in LSTM Keras for outputting mini batches of one files
我有一个可以正常工作的生成器函数。我有一个很大的 .txt 文件列表,其中每个文件也很长。现在的任务是编写一个生成器函数,它需要:
- 一批文件
- 然后从一个文件中提取一批大小为 128 的文件
我现在的代码:
data_files_generator <- function(train_set) {
files <- train_set
next_file <- 0
function() {
# move to the next file (note the <<- assignment operator)
next_file <<- next_file + 1
# if we've exhausted all of the files then start again at the
# beginning of the list (keras generators need to yield
# data infinitely -- termination is controlled by the epochs
# and steps_per_epoch arguments to fit_generator())
if (next_file > length(files))
{next_file <<- 1}
# determine the file name
file <- files[[next_file]]
text <- read_lines(paste(data_dir, file, sep = "" )) %>%
str_to_lower() %>%
str_c(collapse = "\n") %>%
removeNumbers() %>%
tokenize_characters(strip_non_alphanum = FALSE, simplify = TRUE)
text <- text[text %in% chars]
dataset <- map(
seq(1, length(text) - maxlen - 1, by = 3),
~list(sentece = text[.x:(.x + maxlen - 1)], next_char = text[.x + maxlen])
)
dataset <- transpose(dataset)
# Vectorization
x <- array(0, dim = c(length(dataset$sentece), maxlen, length(chars)))
y <- array(0, dim = c(length(dataset$sentece), length(chars)))
for(i in 1:length(dataset$sentece)){
x[i,,] <- sapply(chars, function(x){
as.integer(x == dataset$sentece[[i]])
})
y[i,] <- as.integer(chars == dataset$next_char[[i]])
}
rounded_dim <- floor(dim(x)[1]/mini_batch_size)
match_size_to_batch <- 128 * rounded_dim
x <- x[1:match_size_to_batch, 1:maxlen, 1:length(chars)]
y <- y_val[1:match_size_to_batch, 1:length(chars)]
return(list(x, y))
}
}
所以即将到来的是一个文本文件,它被转换成更小的文本片段(长度 maxlen
),然后被热编码为 0 和 1 矩阵。
问题是,根据我的代码,输出是一个大小为 maxlen x lenght(chars) x samples
的数据立方体,其中样本数量非常大,这就是为什么我希望我的生成器函数始终输出一个大小为 maxlen x lenght(chars) x samples(128)
然后输出下一批大小maxlen x lenght(chars) x samples
直到整个文本文件读入然后转到下一个文本文件...
目前的输出是一个错误:
Error in py_call_impl(callable, dots$args, dots$keywords) :
ValueError: Cannot feed value of shape (112512, 40, 43) for Tensor 'lstm_layer_input_1:0', which has shape '(128, 40, 43)'
希望我已经解释得足够清楚了。我想我必须输入某种 for 循环来遍历样本长度,但我不知道如何将其包含到 gen.函数。
根据错误,您尝试输入形状 (112512, 40, 43)
的 object,但您的 LSTM 层期望输入形状 (128, 40, 43)
的 object。似乎缺少一些代码,但是当您定义输入层时,您是否修复了批量大小?我很幸运将我的输入层定义为:
l_input = Input(shape = (None, num_features), name = 'input_layer')
我怀疑错误是由于这些代码行造成的:
rounded_dim <- floor(dim(x)[1]/mini_batch_size)
match_size_to_batch <- 128 * rounded_dim
这为您提供了比 128 大得多的批量大小。根据 Keras documentation,输入形状应为 (batch_size, timesteps, input_dim)
。整个史诗中的批量大小不需要相同,但对于一个批量,它们都需要具有相同数量的 timesteps
(看起来你用 maxlen
处理)。
我已经实现了一个 for 循环,它现在返回大小为 128 的批次:
更改代码:
data_files_generator <- function(train_set) {
files <- train_set
next_file <- 0
function() {
# move to the next file (note the <<- assignment operator)
next_file <<- next_file + 1
# if we've exhausted all of the files then start again at the
# beginning of the list (keras generators need to yield
# data infinitely -- termination is controlled by the epochs
# and steps_per_epoch arguments to fit_generator())
if (next_file > length(files))
{next_file <<- 1}
# determine the file name
file <- files[[next_file]]
text <- read_lines(paste(data_dir, file, sep = "" )) %>%
str_to_lower() %>%
str_c(collapse = "\n") %>%
removeNumbers() %>%
tokenize_characters(strip_non_alphanum = FALSE, simplify = TRUE)
text <- text[text %in% chars]
dataset <- map(
seq(1, length(text) - maxlen - 1, by = 3),
~list(sentece = text[.x:(.x + maxlen - 1)], next_char = text[.x + maxlen])
)
dataset <- transpose(dataset)
# Vectorization
x <- array(0, dim = c(length(dataset$sentece), maxlen, length(chars)))
y <- array(0, dim = c(length(dataset$sentece), length(chars)))
for(i in 1:length(dataset$sentece)){
x[i,,] <- sapply(chars, function(x){
as.integer(x == dataset$sentece[[i]])
})
y[i,] <- as.integer(chars == dataset$next_char[[i]])
}
rounded_dim <- floor(dim(x)[1]/mini_batch_size)
match_size_to_batch <- 128 * rounded_dim
x <- x[1:match_size_to_batch, 1:maxlen, 1:length(chars)]
y <- y_val[1:match_size_to_batch, 1:length(chars)]
#Edit:
span_start <-1
for (iter in 1:rounded_dim){
i <- iter * 128
span_end <- iter * 128
x <- x[span_start:span_end, 1:maxlen, 1:length(chars)]
y <- y[span_start:span_end, 1:length(chars)]
span_start <- i
return(list(x, y))
}
}
}
我有一个可以正常工作的生成器函数。我有一个很大的 .txt 文件列表,其中每个文件也很长。现在的任务是编写一个生成器函数,它需要:
- 一批文件
- 然后从一个文件中提取一批大小为 128 的文件
我现在的代码:
data_files_generator <- function(train_set) {
files <- train_set
next_file <- 0
function() {
# move to the next file (note the <<- assignment operator)
next_file <<- next_file + 1
# if we've exhausted all of the files then start again at the
# beginning of the list (keras generators need to yield
# data infinitely -- termination is controlled by the epochs
# and steps_per_epoch arguments to fit_generator())
if (next_file > length(files))
{next_file <<- 1}
# determine the file name
file <- files[[next_file]]
text <- read_lines(paste(data_dir, file, sep = "" )) %>%
str_to_lower() %>%
str_c(collapse = "\n") %>%
removeNumbers() %>%
tokenize_characters(strip_non_alphanum = FALSE, simplify = TRUE)
text <- text[text %in% chars]
dataset <- map(
seq(1, length(text) - maxlen - 1, by = 3),
~list(sentece = text[.x:(.x + maxlen - 1)], next_char = text[.x + maxlen])
)
dataset <- transpose(dataset)
# Vectorization
x <- array(0, dim = c(length(dataset$sentece), maxlen, length(chars)))
y <- array(0, dim = c(length(dataset$sentece), length(chars)))
for(i in 1:length(dataset$sentece)){
x[i,,] <- sapply(chars, function(x){
as.integer(x == dataset$sentece[[i]])
})
y[i,] <- as.integer(chars == dataset$next_char[[i]])
}
rounded_dim <- floor(dim(x)[1]/mini_batch_size)
match_size_to_batch <- 128 * rounded_dim
x <- x[1:match_size_to_batch, 1:maxlen, 1:length(chars)]
y <- y_val[1:match_size_to_batch, 1:length(chars)]
return(list(x, y))
}
}
所以即将到来的是一个文本文件,它被转换成更小的文本片段(长度 maxlen
),然后被热编码为 0 和 1 矩阵。
问题是,根据我的代码,输出是一个大小为 maxlen x lenght(chars) x samples
的数据立方体,其中样本数量非常大,这就是为什么我希望我的生成器函数始终输出一个大小为 maxlen x lenght(chars) x samples(128)
然后输出下一批大小maxlen x lenght(chars) x samples
直到整个文本文件读入然后转到下一个文本文件...
目前的输出是一个错误:
Error in py_call_impl(callable, dots$args, dots$keywords) :
ValueError: Cannot feed value of shape (112512, 40, 43) for Tensor 'lstm_layer_input_1:0', which has shape '(128, 40, 43)'
希望我已经解释得足够清楚了。我想我必须输入某种 for 循环来遍历样本长度,但我不知道如何将其包含到 gen.函数。
根据错误,您尝试输入形状 (112512, 40, 43)
的 object,但您的 LSTM 层期望输入形状 (128, 40, 43)
的 object。似乎缺少一些代码,但是当您定义输入层时,您是否修复了批量大小?我很幸运将我的输入层定义为:
l_input = Input(shape = (None, num_features), name = 'input_layer')
我怀疑错误是由于这些代码行造成的:
rounded_dim <- floor(dim(x)[1]/mini_batch_size)
match_size_to_batch <- 128 * rounded_dim
这为您提供了比 128 大得多的批量大小。根据 Keras documentation,输入形状应为 (batch_size, timesteps, input_dim)
。整个史诗中的批量大小不需要相同,但对于一个批量,它们都需要具有相同数量的 timesteps
(看起来你用 maxlen
处理)。
我已经实现了一个 for 循环,它现在返回大小为 128 的批次:
更改代码:
data_files_generator <- function(train_set) {
files <- train_set
next_file <- 0
function() {
# move to the next file (note the <<- assignment operator)
next_file <<- next_file + 1
# if we've exhausted all of the files then start again at the
# beginning of the list (keras generators need to yield
# data infinitely -- termination is controlled by the epochs
# and steps_per_epoch arguments to fit_generator())
if (next_file > length(files))
{next_file <<- 1}
# determine the file name
file <- files[[next_file]]
text <- read_lines(paste(data_dir, file, sep = "" )) %>%
str_to_lower() %>%
str_c(collapse = "\n") %>%
removeNumbers() %>%
tokenize_characters(strip_non_alphanum = FALSE, simplify = TRUE)
text <- text[text %in% chars]
dataset <- map(
seq(1, length(text) - maxlen - 1, by = 3),
~list(sentece = text[.x:(.x + maxlen - 1)], next_char = text[.x + maxlen])
)
dataset <- transpose(dataset)
# Vectorization
x <- array(0, dim = c(length(dataset$sentece), maxlen, length(chars)))
y <- array(0, dim = c(length(dataset$sentece), length(chars)))
for(i in 1:length(dataset$sentece)){
x[i,,] <- sapply(chars, function(x){
as.integer(x == dataset$sentece[[i]])
})
y[i,] <- as.integer(chars == dataset$next_char[[i]])
}
rounded_dim <- floor(dim(x)[1]/mini_batch_size)
match_size_to_batch <- 128 * rounded_dim
x <- x[1:match_size_to_batch, 1:maxlen, 1:length(chars)]
y <- y_val[1:match_size_to_batch, 1:length(chars)]
#Edit:
span_start <-1
for (iter in 1:rounded_dim){
i <- iter * 128
span_end <- iter * 128
x <- x[span_start:span_end, 1:maxlen, 1:length(chars)]
y <- y[span_start:span_end, 1:length(chars)]
span_start <- i
return(list(x, y))
}
}
}