使用 R 下载和提取 .gz 数据文件
Downloading and extracting .gz data file using R
我已经尝试通过改编此 similar question 来解决我的问题。
但是,对于 URL 或我要执行此操作的文件,我收到以下错误。
trying URL 'http://cbio.mskcc.org/microrna_data/human_predictions_S_C_aug2010.txt.gz'
Content type 'application/x-gzip' length 65933953 bytes (62.9 Mb)
opened URL
downloaded 62.9 Mb
Show Traceback
Rerun with Debug
Error in open.connection(file, "rt") : cannot open the connection In addition: Warning message:
In open.connection(file, "rt") :
cannot open zip file 'D:....'
这是我尝试过的:
url_S_C <- "http://cbio.mskcc.org/microrna_data/human_predictions_S_C_aug2010.txt.gz"
tmpFile <- tempfile()
fileName <- gsub(".gz","",basename(url_S_C))
download.file(url_S_C, tmpFile)
data <- read.table(unz(tmpFile, fileName))
unlink(tmpFile)
也许这里有人可以帮助我为什么这个特定文件对我不起作用?
请注意,此文件非常大 (62.9 Mb),但我无法使用类似问题中的 URL 重现错误。
谢谢!
您可以通过以下方式将文件中的数据读入 R(在 Windows 上测试):
library(stringr)
library(plyr)
library(dplyr)
# download and extract file from web
temp <- tempfile()
download.file("http://cbio.mskcc.org/microrna_data/human_predictions_S_C_aug2010.txt.gz", temp)
gzfile(temp, 'rt')
data <- read.csv(temp,
stringsAsFactors = FALSE,
nrows = 20)
unlink(temp)
# column names
my_names <-
str_split(names(data), "\.") %>%
unlist(.)
# toy example using only first 6 rows of dataset
mickey_mouse_data <-
head(data) %>%
unlist(.) %>%
str_split(., "\t") %>%
ldply(.)
names(mickey_mouse_data) <- my_names[-1]
tbl_df(mickey_mouse_data)
mirbase_acc mirna_name gene_id gene_symbol transcript_id ext_transcript_id
1 MIMAT0000062 hsa-let-7a 5270 SERPINE2 uc002vnu.2 NM_006216
2 MIMAT0000062 hsa-let-7a 494188 FBXO47 uc002hrc.2 NM_001008777
3 MIMAT0000062 hsa-let-7a 80025 PANK2 uc002wkc.2 NM_153638
4 MIMAT0000062 hsa-let-7a 26036 ZNF451 uc003pdp.2 AK027074
5 MIMAT0000062 hsa-let-7a 586 BCAT1 uc001rgd.3 NM_005504
6 MIMAT0000062 hsa-let-7a 22903 BTBD3 uc002wnz.2 NM_014962
Variables not shown: mirna_alignment (chr), alignment (chr), gene_alignment (chr),
mirna_start (chr), mirna_end (chr), gene_start (chr), gene_end (chr),
genome_coordinates (chr), conservation (chr), align_score (chr), seed_cat (chr), energy
(chr), mirsvr_score (chr)
一些附加选项,以 R 为基数:
url <- "http://cbio.mskcc.org/microrna_data/human_predictions_S_C_aug2010.txt.gz"
tmp <- tempfile()
##
download.file(url,tmp)
##
data <- read.csv(
gzfile(tmp),
sep="\t",
header=TRUE,
stringsAsFactors=FALSE)
names(data)[1] <- sub("X\.","",names(data)[1])
##
R> head(data)
mirbase_acc mirna_name gene_id gene_symbol transcript_id ext_transcript_id mirna_alignment
1 MIMAT0000062 hsa-let-7a 5270 SERPINE2 uc002vnu.2 NM_006216 uuGAUAUGUUGGAUGAU-GGAGu
2 MIMAT0000062 hsa-let-7a 494188 FBXO47 uc002hrc.2 NM_001008777 uugaUA-UGUU--GGAUGAUGGAGu
3 MIMAT0000062 hsa-let-7a 80025 PANK2 uc002wkc.2 NM_153638 uugauaUGUUGG-AUGAUGGAgu
4 MIMAT0000062 hsa-let-7a 26036 ZNF451 uc003pdp.2 AK027074 uuGAUAUGUUGGAUGAUGGAGu
5 MIMAT0000062 hsa-let-7a 586 BCAT1 uc001rgd.3 NM_005504 uugaUAUGUUGGAUGAUGGAGu
6 MIMAT0000062 hsa-let-7a 22903 BTBD3 uc002wnz.2 NM_014962 uuGAUAUGUUGGAU-GAUGG-AGu
alignment gene_alignment mirna_start mirna_end gene_start gene_end
1 | :|: ||:|| ||| |||| aaCGGUGAAAUCU-CUAGCCUCu 2 21 495 516
2 || |||: ::||||||||: acaaAUCACAGUUUUUACUACCUUc 2 19 459 483
3 |::||: |||||||| aauuucAUGACUGUACUACCUga 3 17 77 99
4 || || | | ||||||| ccCUCUAGA---UUCUACCUCa 2 21 1282 1300
5 :|| |: |||||||| guagGUAAAGGAAACUACCUCa 2 19 6410 6431
6 || || ||| || ||||| || uaCUUUAAAACAUAUCUACCAUCu 2 21 2265 2288
genome_coordinates conservation align_score seed_cat energy mirsvr_score
1 [hg19:2:224840068-224840089:-] 0.5684 122 0 -14.73 -0.7269
2 [hg19:17:37092945-37092969:-] 0.6464 140 0 -16.38 -0.1156
3 [hg19:20:3904018-3904040:+] 0.6522 139 0 -16.04 -0.2066
4 [hg19:6:56966300-56966318:+] 0.7627 144 7 -14.51 -0.8609
5 [hg19:12:24964511-24964532:-] 0.6775 150 7 -15.09 -0.2735
6 [hg19:20:11906579-11906602:+] 0.5740 131 0 -12.59 -0.2540
或者,如果您使用的是类 Unix 系统,则可以像这样获取 .txt
文件(在 R 外部或在 R 内部使用 system
或 system2
) :
[nathan@nrussell tmp]$ url="http://cbio.mskcc.org/microrna_data/human_predictions_S_C_aug2010.txt.gz"
[nathan@nrussell tmp]$ wget "$url" && gunzip human_predictions_S_C_aug2010.txt.gz
然后按照上面的步骤进行,您从 wget
和 gunzip
执行的地方读取 human_predictions_S_C_aug2010.txt
,
data <- read.csv(
"~/tmp/human_predictions_S_C_aug2010.txt",
stringsAsFactors=FALSE,
header=TRUE,
sep="\t")
就我而言。
我已经尝试通过改编此 similar question 来解决我的问题。 但是,对于 URL 或我要执行此操作的文件,我收到以下错误。
trying URL 'http://cbio.mskcc.org/microrna_data/human_predictions_S_C_aug2010.txt.gz'
Content type 'application/x-gzip' length 65933953 bytes (62.9 Mb)
opened URL
downloaded 62.9 Mb
Show Traceback
Rerun with Debug
Error in open.connection(file, "rt") : cannot open the connection In addition: Warning message:
In open.connection(file, "rt") :
cannot open zip file 'D:....'
这是我尝试过的:
url_S_C <- "http://cbio.mskcc.org/microrna_data/human_predictions_S_C_aug2010.txt.gz"
tmpFile <- tempfile()
fileName <- gsub(".gz","",basename(url_S_C))
download.file(url_S_C, tmpFile)
data <- read.table(unz(tmpFile, fileName))
unlink(tmpFile)
也许这里有人可以帮助我为什么这个特定文件对我不起作用? 请注意,此文件非常大 (62.9 Mb),但我无法使用类似问题中的 URL 重现错误。
谢谢!
您可以通过以下方式将文件中的数据读入 R(在 Windows 上测试):
library(stringr)
library(plyr)
library(dplyr)
# download and extract file from web
temp <- tempfile()
download.file("http://cbio.mskcc.org/microrna_data/human_predictions_S_C_aug2010.txt.gz", temp)
gzfile(temp, 'rt')
data <- read.csv(temp,
stringsAsFactors = FALSE,
nrows = 20)
unlink(temp)
# column names
my_names <-
str_split(names(data), "\.") %>%
unlist(.)
# toy example using only first 6 rows of dataset
mickey_mouse_data <-
head(data) %>%
unlist(.) %>%
str_split(., "\t") %>%
ldply(.)
names(mickey_mouse_data) <- my_names[-1]
tbl_df(mickey_mouse_data)
mirbase_acc mirna_name gene_id gene_symbol transcript_id ext_transcript_id
1 MIMAT0000062 hsa-let-7a 5270 SERPINE2 uc002vnu.2 NM_006216
2 MIMAT0000062 hsa-let-7a 494188 FBXO47 uc002hrc.2 NM_001008777
3 MIMAT0000062 hsa-let-7a 80025 PANK2 uc002wkc.2 NM_153638
4 MIMAT0000062 hsa-let-7a 26036 ZNF451 uc003pdp.2 AK027074
5 MIMAT0000062 hsa-let-7a 586 BCAT1 uc001rgd.3 NM_005504
6 MIMAT0000062 hsa-let-7a 22903 BTBD3 uc002wnz.2 NM_014962
Variables not shown: mirna_alignment (chr), alignment (chr), gene_alignment (chr),
mirna_start (chr), mirna_end (chr), gene_start (chr), gene_end (chr),
genome_coordinates (chr), conservation (chr), align_score (chr), seed_cat (chr), energy
(chr), mirsvr_score (chr)
一些附加选项,以 R 为基数:
url <- "http://cbio.mskcc.org/microrna_data/human_predictions_S_C_aug2010.txt.gz"
tmp <- tempfile()
##
download.file(url,tmp)
##
data <- read.csv(
gzfile(tmp),
sep="\t",
header=TRUE,
stringsAsFactors=FALSE)
names(data)[1] <- sub("X\.","",names(data)[1])
##
R> head(data)
mirbase_acc mirna_name gene_id gene_symbol transcript_id ext_transcript_id mirna_alignment
1 MIMAT0000062 hsa-let-7a 5270 SERPINE2 uc002vnu.2 NM_006216 uuGAUAUGUUGGAUGAU-GGAGu
2 MIMAT0000062 hsa-let-7a 494188 FBXO47 uc002hrc.2 NM_001008777 uugaUA-UGUU--GGAUGAUGGAGu
3 MIMAT0000062 hsa-let-7a 80025 PANK2 uc002wkc.2 NM_153638 uugauaUGUUGG-AUGAUGGAgu
4 MIMAT0000062 hsa-let-7a 26036 ZNF451 uc003pdp.2 AK027074 uuGAUAUGUUGGAUGAUGGAGu
5 MIMAT0000062 hsa-let-7a 586 BCAT1 uc001rgd.3 NM_005504 uugaUAUGUUGGAUGAUGGAGu
6 MIMAT0000062 hsa-let-7a 22903 BTBD3 uc002wnz.2 NM_014962 uuGAUAUGUUGGAU-GAUGG-AGu
alignment gene_alignment mirna_start mirna_end gene_start gene_end
1 | :|: ||:|| ||| |||| aaCGGUGAAAUCU-CUAGCCUCu 2 21 495 516
2 || |||: ::||||||||: acaaAUCACAGUUUUUACUACCUUc 2 19 459 483
3 |::||: |||||||| aauuucAUGACUGUACUACCUga 3 17 77 99
4 || || | | ||||||| ccCUCUAGA---UUCUACCUCa 2 21 1282 1300
5 :|| |: |||||||| guagGUAAAGGAAACUACCUCa 2 19 6410 6431
6 || || ||| || ||||| || uaCUUUAAAACAUAUCUACCAUCu 2 21 2265 2288
genome_coordinates conservation align_score seed_cat energy mirsvr_score
1 [hg19:2:224840068-224840089:-] 0.5684 122 0 -14.73 -0.7269
2 [hg19:17:37092945-37092969:-] 0.6464 140 0 -16.38 -0.1156
3 [hg19:20:3904018-3904040:+] 0.6522 139 0 -16.04 -0.2066
4 [hg19:6:56966300-56966318:+] 0.7627 144 7 -14.51 -0.8609
5 [hg19:12:24964511-24964532:-] 0.6775 150 7 -15.09 -0.2735
6 [hg19:20:11906579-11906602:+] 0.5740 131 0 -12.59 -0.2540
或者,如果您使用的是类 Unix 系统,则可以像这样获取 .txt
文件(在 R 外部或在 R 内部使用 system
或 system2
) :
[nathan@nrussell tmp]$ url="http://cbio.mskcc.org/microrna_data/human_predictions_S_C_aug2010.txt.gz"
[nathan@nrussell tmp]$ wget "$url" && gunzip human_predictions_S_C_aug2010.txt.gz
然后按照上面的步骤进行,您从 wget
和 gunzip
执行的地方读取 human_predictions_S_C_aug2010.txt
,
data <- read.csv(
"~/tmp/human_predictions_S_C_aug2010.txt",
stringsAsFactors=FALSE,
header=TRUE,
sep="\t")
就我而言。