当从 R 中的 zip 文件中读取数据时,它会破坏之前读取的数据
When reading in data from a zip-file in R, it corrupts the previous read-in data
我的目标是直接从 Web 读取 zip 文件 (opentransportdata.swiss)。每个 zip 文件包含多个 .txt 文件。在我的示例中,我试图检索 routes.txt 文件的数据。
所以我的代码如下:
library(tidyverse)
# links
tt_url <- c("https://opentransportdata.swiss/de/dataset/7787e566-03cf-4cd5-8a66-b5af08547e74/resource/4bc9d75e-cdd7-4020-8ee1-9dd494ee8b4c/download/gtfsfp20162016-11-30.zip",
"https://opentransportdata.swiss/de/dataset/587ecf41-eb18-448a-8073-7076bc3cbfeb/resource/e499a630-4e65-4e00-8522-26c5c78b88ca/download/gtfsfp20172017-12-06.zip")
# download zip files
f_get_data <- function(i, data){
url <- tt_url[i]
zip_file <- tempfile(fileext = ".zip")
download.file(url, zip_file, mode = "wb")
df <- read_delim(unzip(zip_file, files = data), delim = ",") %>%
mutate(year = i + 2015)
return(df)
}
test_1 <- f_get_data(1, "routes.txt")
head(test_1)
test_2 <- f_get_data(2, "routes.txt")
head(test_2)
head(test_1)
如果第一次应用函数 f_get_data(1, "routes.txt),检索到的 df ,test_1 是正确的。
head(test_1)
# A tibble: 6 × 8
route_id agency_id route_short_name route_long_name route_type route_color route_text_color year
<chr> <lgl> <chr> <lgl> <dbl> <lgl> <lgl> <dbl>
1 11-21-j16-1 NA 021 NA 3 NA NA 2016
2 11-22-j16-1 NA 022 NA 3 NA NA 2016
3 16-22-j16-1 NA 022 NA 3 NA NA 2016
4 11-25-j16-1 NA 025 NA 3 NA NA 2016
5 11-41-j16-1 NA 041 NA 3 NA NA 2016
6 11-42-j16-1 NA 042 NA 3 NA NA 2016
如果我使用 f_get_data(2, "routes.txt) 进入下一期,检索到的 df test_2 也是正确的。
但是,在我完成第二次迭代后,第一个 df test_1 会自行损坏:
> head(test_2)
# A tibble: 6 × 7
route_id agency_id route_short_name route_long_name route_desc route_type year
<chr> <chr> <chr> <lgl> <chr> <dbl> <dbl>
1 79-0-j17-1 881 00 NA Bus 700 2017
2 11-61-j17-1 7031 061 NA Bus 700 2017
3 11-62-j17-1 7031 062 NA Bus 700 2017
4 24-64-j17-1 801 064 NA Bus 700 2017
5 24-65-j17-1 801 065 NA Bus 700 2017
6 24-66-j17-1 801 066 NA Bus 700 2017
> head(test_1)
# A tibble: 6 × 8
route_id agency_id route_short_name route_long_name route_type route_color route_text_color year
<chr> <lgl> <chr> <lgl> <dbl> <lgl> <lgl> <dbl>
1 ",\"00\",\"\",\"Bus" NA "00\"\r\n" NA NA NA NA 2016
2 "7031\",\"061\",\"" NA "us\",\"" NA NA NA NA 2016
3 "7-1\",\"7031\",\"" NA ",\"\",\"" NA NA NA NA 2016
4 "-64-j17-1\",\"8" NA "064" NA NA NA NA 2016
5 "\r\n\"24-65-j17-" NA "801\"," NA NA NA NA 2016
6 "700\r\n24-66" NA "-1\",\"" NA NA NA NA 2016
有谁知道为什么会发生这种情况,尤其是如何发生这种情况?在我看来,我把我的函数的检索数据赋值到某个数据框后,它应该独立于我的函数后面的使用。
不能复制这个,虽然不是编码 tidyversish。也许试试我的代码。
> # links
> tt_url <- c("https://opentransportdata.swiss/de/dataset/7787e566-03cf-4cd5-8a66-b5af08547e74/resource/4bc9d75e-cdd7-4020-8ee1-9dd494ee8b4c/download/gtfsfp20162016-11-30.zip",
+ "https://opentransportdata.swiss/de/dataset/587ecf41-eb18-448a-8073-7076bc3cbfeb/resource/e499a630-4e65-4e00-8522-26c5c78b88ca/download/gtfsfp20172017-12-06.zip")
> # download zip files
> f_get_data <- function(i, data) {
+ on.exit(unlink(temp)) ## don't forget to unlink your tempfiles!
+ temp <- tempfile(fileext='.zip')
+ url <- tt_url[i]
+ download.file(url, temp, mode = "wb")
+ df <- read.csv(unzip(temp, files=data)) |>
+ transform(year=i + 2015)
+ return(df)
+ }
>
> test_1 <- f_get_data(1, "routes.txt")
trying URL 'https://opentransportdata.swiss/de/dataset/7787e566-03cf-4cd5-8a66-b5af08547e74/resource/4bc9d75e-cdd7-4020-8ee1-9dd494ee8b4c/download/gtfsfp20162016-11-30.zip'
downloaded 26.1 MB
> head(test_1)
route_id agency_id route_short_name route_long_name route_type route_color route_text_color year
1 11-21-j16-1 NA 021 NA 3 NA NA 2016
2 11-22-j16-1 NA 022 NA 3 NA NA 2016
3 16-22-j16-1 NA 022 NA 3 NA NA 2016
4 11-25-j16-1 NA 025 NA 3 NA NA 2016
5 11-41-j16-1 NA 041 NA 3 NA NA 2016
6 11-42-j16-1 NA 042 NA 3 NA NA 2016
> test_2 <- f_get_data(2, "routes.txt")
trying URL 'https://opentransportdata.swiss/de/dataset/587ecf41-eb18-448a-8073-7076bc3cbfeb/resource/e499a630-4e65-4e00-8522-26c5c78b88ca/download/gtfsfp20172017-12-06.zip'
downloaded 81.1 MB
> head(test_2)
route_id agency_id route_short_name route_long_name route_desc route_type year
1 79-0-j17-1 881 00 NA Bus 700 2017
2 11-61-j17-1 7031 061 NA Bus 700 2017
3 11-62-j17-1 7031 062 NA Bus 700 2017
4 24-64-j17-1 801 064 NA Bus 700 2017
5 24-65-j17-1 801 065 NA Bus 700 2017
6 24-66-j17-1 801 066 NA Bus 700 2017
> head(test_1)
route_id agency_id route_short_name route_long_name route_type route_color route_text_color year
1 11-21-j16-1 NA 021 NA 3 NA NA 2016
2 11-22-j16-1 NA 022 NA 3 NA NA 2016
3 16-22-j16-1 NA 022 NA 3 NA NA 2016
4 11-25-j16-1 NA 025 NA 3 NA NA 2016
5 11-41-j16-1 NA 041 NA 3 NA NA 2016
6 11-42-j16-1 NA 042 NA 3 NA NA 2016
>
问题是 read_delim()
函数的默认行为。为了提高性能,数据以惰性方式加载,这意味着仅在需要时才访问数据。
所以实际上“f_get_data”中的 return 值只是一个指向数据的指针。在这种情况下,它是您的临时文件的指针,每次调用该函数时都会被覆盖。
要解决此问题,请在 read_delim()
函数调用中将惰性设置为 FALSE。
df <- read_delim(unzip(zip_file, files = data), delim = ",", lazy=FALSE) %>%
mutate(year = i + 2015)
我的目标是直接从 Web 读取 zip 文件 (opentransportdata.swiss)。每个 zip 文件包含多个 .txt 文件。在我的示例中,我试图检索 routes.txt 文件的数据。
所以我的代码如下:
library(tidyverse)
# links
tt_url <- c("https://opentransportdata.swiss/de/dataset/7787e566-03cf-4cd5-8a66-b5af08547e74/resource/4bc9d75e-cdd7-4020-8ee1-9dd494ee8b4c/download/gtfsfp20162016-11-30.zip",
"https://opentransportdata.swiss/de/dataset/587ecf41-eb18-448a-8073-7076bc3cbfeb/resource/e499a630-4e65-4e00-8522-26c5c78b88ca/download/gtfsfp20172017-12-06.zip")
# download zip files
f_get_data <- function(i, data){
url <- tt_url[i]
zip_file <- tempfile(fileext = ".zip")
download.file(url, zip_file, mode = "wb")
df <- read_delim(unzip(zip_file, files = data), delim = ",") %>%
mutate(year = i + 2015)
return(df)
}
test_1 <- f_get_data(1, "routes.txt")
head(test_1)
test_2 <- f_get_data(2, "routes.txt")
head(test_2)
head(test_1)
如果第一次应用函数 f_get_data(1, "routes.txt),检索到的 df ,test_1 是正确的。
head(test_1)
# A tibble: 6 × 8
route_id agency_id route_short_name route_long_name route_type route_color route_text_color year
<chr> <lgl> <chr> <lgl> <dbl> <lgl> <lgl> <dbl>
1 11-21-j16-1 NA 021 NA 3 NA NA 2016
2 11-22-j16-1 NA 022 NA 3 NA NA 2016
3 16-22-j16-1 NA 022 NA 3 NA NA 2016
4 11-25-j16-1 NA 025 NA 3 NA NA 2016
5 11-41-j16-1 NA 041 NA 3 NA NA 2016
6 11-42-j16-1 NA 042 NA 3 NA NA 2016
如果我使用 f_get_data(2, "routes.txt) 进入下一期,检索到的 df test_2 也是正确的。
但是,在我完成第二次迭代后,第一个 df test_1 会自行损坏:
> head(test_2)
# A tibble: 6 × 7
route_id agency_id route_short_name route_long_name route_desc route_type year
<chr> <chr> <chr> <lgl> <chr> <dbl> <dbl>
1 79-0-j17-1 881 00 NA Bus 700 2017
2 11-61-j17-1 7031 061 NA Bus 700 2017
3 11-62-j17-1 7031 062 NA Bus 700 2017
4 24-64-j17-1 801 064 NA Bus 700 2017
5 24-65-j17-1 801 065 NA Bus 700 2017
6 24-66-j17-1 801 066 NA Bus 700 2017
> head(test_1)
# A tibble: 6 × 8
route_id agency_id route_short_name route_long_name route_type route_color route_text_color year
<chr> <lgl> <chr> <lgl> <dbl> <lgl> <lgl> <dbl>
1 ",\"00\",\"\",\"Bus" NA "00\"\r\n" NA NA NA NA 2016
2 "7031\",\"061\",\"" NA "us\",\"" NA NA NA NA 2016
3 "7-1\",\"7031\",\"" NA ",\"\",\"" NA NA NA NA 2016
4 "-64-j17-1\",\"8" NA "064" NA NA NA NA 2016
5 "\r\n\"24-65-j17-" NA "801\"," NA NA NA NA 2016
6 "700\r\n24-66" NA "-1\",\"" NA NA NA NA 2016
有谁知道为什么会发生这种情况,尤其是如何发生这种情况?在我看来,我把我的函数的检索数据赋值到某个数据框后,它应该独立于我的函数后面的使用。
不能复制这个,虽然不是编码 tidyversish。也许试试我的代码。
> # links
> tt_url <- c("https://opentransportdata.swiss/de/dataset/7787e566-03cf-4cd5-8a66-b5af08547e74/resource/4bc9d75e-cdd7-4020-8ee1-9dd494ee8b4c/download/gtfsfp20162016-11-30.zip",
+ "https://opentransportdata.swiss/de/dataset/587ecf41-eb18-448a-8073-7076bc3cbfeb/resource/e499a630-4e65-4e00-8522-26c5c78b88ca/download/gtfsfp20172017-12-06.zip")
> # download zip files
> f_get_data <- function(i, data) {
+ on.exit(unlink(temp)) ## don't forget to unlink your tempfiles!
+ temp <- tempfile(fileext='.zip')
+ url <- tt_url[i]
+ download.file(url, temp, mode = "wb")
+ df <- read.csv(unzip(temp, files=data)) |>
+ transform(year=i + 2015)
+ return(df)
+ }
>
> test_1 <- f_get_data(1, "routes.txt")
trying URL 'https://opentransportdata.swiss/de/dataset/7787e566-03cf-4cd5-8a66-b5af08547e74/resource/4bc9d75e-cdd7-4020-8ee1-9dd494ee8b4c/download/gtfsfp20162016-11-30.zip'
downloaded 26.1 MB
> head(test_1)
route_id agency_id route_short_name route_long_name route_type route_color route_text_color year
1 11-21-j16-1 NA 021 NA 3 NA NA 2016
2 11-22-j16-1 NA 022 NA 3 NA NA 2016
3 16-22-j16-1 NA 022 NA 3 NA NA 2016
4 11-25-j16-1 NA 025 NA 3 NA NA 2016
5 11-41-j16-1 NA 041 NA 3 NA NA 2016
6 11-42-j16-1 NA 042 NA 3 NA NA 2016
> test_2 <- f_get_data(2, "routes.txt")
trying URL 'https://opentransportdata.swiss/de/dataset/587ecf41-eb18-448a-8073-7076bc3cbfeb/resource/e499a630-4e65-4e00-8522-26c5c78b88ca/download/gtfsfp20172017-12-06.zip'
downloaded 81.1 MB
> head(test_2)
route_id agency_id route_short_name route_long_name route_desc route_type year
1 79-0-j17-1 881 00 NA Bus 700 2017
2 11-61-j17-1 7031 061 NA Bus 700 2017
3 11-62-j17-1 7031 062 NA Bus 700 2017
4 24-64-j17-1 801 064 NA Bus 700 2017
5 24-65-j17-1 801 065 NA Bus 700 2017
6 24-66-j17-1 801 066 NA Bus 700 2017
> head(test_1)
route_id agency_id route_short_name route_long_name route_type route_color route_text_color year
1 11-21-j16-1 NA 021 NA 3 NA NA 2016
2 11-22-j16-1 NA 022 NA 3 NA NA 2016
3 16-22-j16-1 NA 022 NA 3 NA NA 2016
4 11-25-j16-1 NA 025 NA 3 NA NA 2016
5 11-41-j16-1 NA 041 NA 3 NA NA 2016
6 11-42-j16-1 NA 042 NA 3 NA NA 2016
>
问题是 read_delim()
函数的默认行为。为了提高性能,数据以惰性方式加载,这意味着仅在需要时才访问数据。
所以实际上“f_get_data”中的 return 值只是一个指向数据的指针。在这种情况下,它是您的临时文件的指针,每次调用该函数时都会被覆盖。
要解决此问题,请在 read_delim()
函数调用中将惰性设置为 FALSE。
df <- read_delim(unzip(zip_file, files = data), delim = ",", lazy=FALSE) %>%
mutate(year = i + 2015)