有没有办法在一定数量的块后停止 readr::read_tsv_chunked() ?
Is there a way to stop readr::read_tsv_chunked() after a certain number of chunks?
我正尝试在大型 .tsv 文件上使用 read_tsv_chunked()
,并希望在一定数量的块后停止。
@jimhester 提出了一种有用的方法,可以使用 browse()
交互查看给定块:https://github.com/tidyverse/readr/issues/848#issuecomment-388234659,但我想编写一个函数 1) return 只是感兴趣的部分;和 2) 在 returning 该块后停止读取文件。
我已经修改了 Jim 对 return 块的响应,以便我可以将其与 DataFrameCallback
一起使用,但无法弄清楚如何停止从 read_tsv_chunked()
中读取].
我目前的方法:
get_problem_chunk <- function(num) {
i <- 1
function(x, pos) {
if (i == num) {
i <<- i + 1
return(x)
}
i <<- i + 1
message(pos) # to see that it's scanning the whole file
return(NULL) # break() or error() cause errors
}
}
write_tsv(mtcars, "mtcars.tsv")
read_tsv_chunked("mtcars.tsv", DataFrameCallback$new(get_problem_chunk(3)), chunk_size = 3)
如您所见,return 是我想要的块,但不会停止阅读,直到回调不再获取任何块:
> read_tsv_chunked("mtcars.tsv", DataFrameCallback$new(get_problem_chunk(3)), chunk_size = 3)
Parsed with column specification:
cols(
mpg = col_double(),
cyl = col_integer(),
disp = col_integer(),
hp = col_integer(),
drat = col_double(),
wt = col_double(),
qsec = col_double(),
vs = col_integer(),
am = col_integer(),
gear = col_integer(),
carb = col_integer()
)
1
4
<I WANT IT TO STOP HERE, BUT DON'T KNOW HOW>
10
13
16
19
22
25
28
31
# A tibble: 3 x 11
mpg cyl disp hp drat wt qsec vs am gear carb
<dbl> <int> <int> <int> <dbl> <dbl> <dbl> <int> <int> <int> <int>
1 14.3 8 360 245 3.21 3.57 15.8 0 0 3 4
2 24.4 4 NA 62 3.69 3.19 20 1 0 4 2
3 22.8 4 NA 95 3.92 3.15 22.9 1 0 4 2
由于readr
包中的read_tsv_chunked()
函数没有提供停止读取的功能,我想,也许使用更基本的read_tsv()
函数确实提供了可能性读入 n 行后跳过和停止:
require(readr)
write.table(mtcars, "mtcars.tsv", sep = "\t", quote = FALSE)
read_tsv_chunk <- function(fpath, start.row, end.row, ...) {
# Read read_tsv() but only from row n to m
# For the column names, read one line:
df.1 <- suppressWarnings(read_tsv(fpath, skip = 0, n_max = 1))
# Then read again, from start.row to end.row, both included
skip.row = start.row - 1
df <- suppressWarnings((read_tsv(fpath, skip = skip.row, n_max = end.row - skip.row , ...))
colnames(df) <- colnames(df.1)
df
}
现在:
read_tsv_chunk("mtcars.tsv", 7, 9)
# read "mtcars.tsv" from the 7th to the 9th column (both included)
给出:
## Parsed with column specification:
## cols(
## mpg = col_character(),
## cyl = col_integer(),
## disp = col_integer(),("mtcars.tsv", chunk_size=3, col_names = TRUE, skip = 6, g
## hp = col_integer(),
## drat = col_integer(),d("mtcars.tsv", chunk_size = 3, skip = 6, col_names = TRUE
## wt = col_double(),
## qsec = col_double(),
## vs = col_double(),
## am = col_integer(),
## gear = col_integer(),
## carb = col_integer()
## )
## Parsed with column specification:
## cols(
## Valiant = col_character(),
## `18.1` = col_double(),
## `6` = col_integer(),
## `225` = col_double(),
## `105` = col_integer(),
## `2.76` = col_double(),
## `3.46` = col_double(),
## `20.22` = col_double(),
## `1` = col_integer(),
## `0` = col_integer(),
## `3` = col_integer(),
## `1_1` = col_integer()
## )
## # A tibble: 3 x 12
## mpg cyl disp hp drat wt qsec vs am gear carb `NA`
## <chr> <dbl> <int> <dbl> <int> <dbl> <dbl> <dbl> <int> <int> <int> <int>
## 1 Duster 360 14.3 8 360. 245 3.21 3.57 15.8 0 0 3 4
## 2 Merc 240D 24.4 4 147. 62 3.69 3.19 20.0 1 0 4 2
## 3 Merc 230 22.8 4 141. 95 3.92 3.15 22.9 1 0 4 2
`Parsed with column specification 出现了两次,这是不好的,第二次出现了错误的列名...
其实你也可以这样做:
df <- read_tsv_chunked("mtcars.tsv", chunk_size = 3, skip = 6, col_names = TRUE, guess_max = 3)
df
## # A tibble: 3 x 12
## mpg cyl disp hp drat wt qsec vs am gear carb `NA`
## <chr> <dbl> <int> <dbl> <int> <dbl> <dbl> <dbl> <int> <int> <int> <int>
## 1 Duster 360 14.3 8 360. 245 3.21 3.57 15.8 0 0 3 4
## 2 Merc 240D 24.4 4 147. 62 3.69 3.19 20.0 1 0 4 2
## 3 Merc 230 22.8 4 141. 95 3.92 3.15 22.9 1 0 4 2
但是并不比我写的函数好...
(例如,未正确识别 df$disp
如果您使用 read.table
一个非常基本的 R 函数来读取表格(read.csv
和 read.delim
及其变体是 read.table
的包装函数):
将有一个 nrow =
参数来确定在读入多少行后停止读取文件。并且有skip =
参数决定开头应该跳过多少行。
read.table(file, header = TRUE, sep = "\t", quote = "\"",
dec = ".", fill = TRUE, comment.char = "#", nrow = 3, skip = 2 * 3)
returns你想要什么:
X18.1 X6 X225 X105 X2.76 X3.46 X20.22 X1 X0 X3 X1.1
1 14.3 8 360.0 245 3.21 3.57 15.84 0 0 3 4
2 24.4 4 146.7 62 3.69 3.19 20.00 1 0 4 2
3 22.8 4 140.8 95 3.92 3.15 22.90 1 0 4 2
@jimhester 再次救援 - https://github.com/tidyverse/readr/issues/851#issuecomment-388929640
You can do this by using the SideEffectCallback (which is the default
when passed a normal function) and returning the results using the <<-
operator. The SideEffectCallback stops reading when the callback
function returns FALSE. e.g.
library(readr)
get_problem_chunk <- function(num) {
i <- 1
function(x, pos) {
if (i == num) {
res <<- x
return(FALSE)
}
i <<- i + 1
}
}
write_tsv(mtcars, "mtcars.tsv")
read_tsv_chunked("mtcars.tsv", get_problem_chunk(3), chunk_size = 2, col_types = cols())
#> NULL
res
#> # A tibble: 2 x 11
#> mpg cyl disp hp drat wt qsec vs am gear carb
#> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 18.7 8 360 175 3.15 3.44 17.0 0 0 3 2
#> 2 18.1 6 225 105 2.76 3.46 20.2 1 0 3 1
我正尝试在大型 .tsv 文件上使用 read_tsv_chunked()
,并希望在一定数量的块后停止。
@jimhester 提出了一种有用的方法,可以使用 browse()
交互查看给定块:https://github.com/tidyverse/readr/issues/848#issuecomment-388234659,但我想编写一个函数 1) return 只是感兴趣的部分;和 2) 在 returning 该块后停止读取文件。
我已经修改了 Jim 对 return 块的响应,以便我可以将其与 DataFrameCallback
一起使用,但无法弄清楚如何停止从 read_tsv_chunked()
中读取].
我目前的方法:
get_problem_chunk <- function(num) {
i <- 1
function(x, pos) {
if (i == num) {
i <<- i + 1
return(x)
}
i <<- i + 1
message(pos) # to see that it's scanning the whole file
return(NULL) # break() or error() cause errors
}
}
write_tsv(mtcars, "mtcars.tsv")
read_tsv_chunked("mtcars.tsv", DataFrameCallback$new(get_problem_chunk(3)), chunk_size = 3)
如您所见,return 是我想要的块,但不会停止阅读,直到回调不再获取任何块:
> read_tsv_chunked("mtcars.tsv", DataFrameCallback$new(get_problem_chunk(3)), chunk_size = 3)
Parsed with column specification:
cols(
mpg = col_double(),
cyl = col_integer(),
disp = col_integer(),
hp = col_integer(),
drat = col_double(),
wt = col_double(),
qsec = col_double(),
vs = col_integer(),
am = col_integer(),
gear = col_integer(),
carb = col_integer()
)
1
4
<I WANT IT TO STOP HERE, BUT DON'T KNOW HOW>
10
13
16
19
22
25
28
31
# A tibble: 3 x 11
mpg cyl disp hp drat wt qsec vs am gear carb
<dbl> <int> <int> <int> <dbl> <dbl> <dbl> <int> <int> <int> <int>
1 14.3 8 360 245 3.21 3.57 15.8 0 0 3 4
2 24.4 4 NA 62 3.69 3.19 20 1 0 4 2
3 22.8 4 NA 95 3.92 3.15 22.9 1 0 4 2
由于readr
包中的read_tsv_chunked()
函数没有提供停止读取的功能,我想,也许使用更基本的read_tsv()
函数确实提供了可能性读入 n 行后跳过和停止:
require(readr)
write.table(mtcars, "mtcars.tsv", sep = "\t", quote = FALSE)
read_tsv_chunk <- function(fpath, start.row, end.row, ...) {
# Read read_tsv() but only from row n to m
# For the column names, read one line:
df.1 <- suppressWarnings(read_tsv(fpath, skip = 0, n_max = 1))
# Then read again, from start.row to end.row, both included
skip.row = start.row - 1
df <- suppressWarnings((read_tsv(fpath, skip = skip.row, n_max = end.row - skip.row , ...))
colnames(df) <- colnames(df.1)
df
}
现在:
read_tsv_chunk("mtcars.tsv", 7, 9)
# read "mtcars.tsv" from the 7th to the 9th column (both included)
给出:
## Parsed with column specification:
## cols(
## mpg = col_character(),
## cyl = col_integer(),
## disp = col_integer(),("mtcars.tsv", chunk_size=3, col_names = TRUE, skip = 6, g
## hp = col_integer(),
## drat = col_integer(),d("mtcars.tsv", chunk_size = 3, skip = 6, col_names = TRUE
## wt = col_double(),
## qsec = col_double(),
## vs = col_double(),
## am = col_integer(),
## gear = col_integer(),
## carb = col_integer()
## )
## Parsed with column specification:
## cols(
## Valiant = col_character(),
## `18.1` = col_double(),
## `6` = col_integer(),
## `225` = col_double(),
## `105` = col_integer(),
## `2.76` = col_double(),
## `3.46` = col_double(),
## `20.22` = col_double(),
## `1` = col_integer(),
## `0` = col_integer(),
## `3` = col_integer(),
## `1_1` = col_integer()
## )
## # A tibble: 3 x 12
## mpg cyl disp hp drat wt qsec vs am gear carb `NA`
## <chr> <dbl> <int> <dbl> <int> <dbl> <dbl> <dbl> <int> <int> <int> <int>
## 1 Duster 360 14.3 8 360. 245 3.21 3.57 15.8 0 0 3 4
## 2 Merc 240D 24.4 4 147. 62 3.69 3.19 20.0 1 0 4 2
## 3 Merc 230 22.8 4 141. 95 3.92 3.15 22.9 1 0 4 2
`Parsed with column specification 出现了两次,这是不好的,第二次出现了错误的列名...
其实你也可以这样做:
df <- read_tsv_chunked("mtcars.tsv", chunk_size = 3, skip = 6, col_names = TRUE, guess_max = 3)
df
## # A tibble: 3 x 12
## mpg cyl disp hp drat wt qsec vs am gear carb `NA`
## <chr> <dbl> <int> <dbl> <int> <dbl> <dbl> <dbl> <int> <int> <int> <int>
## 1 Duster 360 14.3 8 360. 245 3.21 3.57 15.8 0 0 3 4
## 2 Merc 240D 24.4 4 147. 62 3.69 3.19 20.0 1 0 4 2
## 3 Merc 230 22.8 4 141. 95 3.92 3.15 22.9 1 0 4 2
但是并不比我写的函数好...
(例如,未正确识别 df$disp
如果您使用 read.table
一个非常基本的 R 函数来读取表格(read.csv
和 read.delim
及其变体是 read.table
的包装函数):
将有一个 nrow =
参数来确定在读入多少行后停止读取文件。并且有skip =
参数决定开头应该跳过多少行。
read.table(file, header = TRUE, sep = "\t", quote = "\"",
dec = ".", fill = TRUE, comment.char = "#", nrow = 3, skip = 2 * 3)
returns你想要什么:
X18.1 X6 X225 X105 X2.76 X3.46 X20.22 X1 X0 X3 X1.1
1 14.3 8 360.0 245 3.21 3.57 15.84 0 0 3 4
2 24.4 4 146.7 62 3.69 3.19 20.00 1 0 4 2
3 22.8 4 140.8 95 3.92 3.15 22.90 1 0 4 2
@jimhester 再次救援 - https://github.com/tidyverse/readr/issues/851#issuecomment-388929640
You can do this by using the SideEffectCallback (which is the default when passed a normal function) and returning the results using the <<- operator. The SideEffectCallback stops reading when the callback function returns FALSE. e.g.
library(readr) get_problem_chunk <- function(num) { i <- 1 function(x, pos) { if (i == num) { res <<- x return(FALSE) } i <<- i + 1 } } write_tsv(mtcars, "mtcars.tsv") read_tsv_chunked("mtcars.tsv", get_problem_chunk(3), chunk_size = 2, col_types = cols()) #> NULL res #> # A tibble: 2 x 11 #> mpg cyl disp hp drat wt qsec vs am gear carb #> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> #> 1 18.7 8 360 175 3.15 3.44 17.0 0 0 3 2 #> 2 18.1 6 225 105 2.76 3.46 20.2 1 0 3 1