split 函数不会 return 对大型数据集进行任何观察
split function does not return any observations with large dataset
我有一个这样的数据框:
seqnames pos strand nucleotide count
id1 12 + A 13
id1 13 + C 25
id2 24 + G 10
id2 25 + T 25
id2 26 + A 10
id3 10 + C 5
但是总共有10万多行,seqnames
有3138层。我想根据 seqnames 将它拆分成数据帧列表,所以我使用了 split 函数:
data_list <- split(data,data$seqnames)
但它只是 returns 像这样的东西:
List of 3138
$ id1:'data.frame': 0 obs. of 6 variables:
..$ seqnames : Factor w/ 3138 levels "id1","id2",..:
..$ pos : int(0)
..$ strand : Factor w/ 3 levels "+","-","*":
..$ nucleotide: Factor w/ 8 levels "A","C","G","T",..:
..$ count : int(0)
..$ sample_id : chr(0)
$ id2:'data.frame': 0 obs. of 6 variables:
..$ seqnames : Factor w/ 3138 levels "id1","id2",..:
..$ pos : int(0)
..$ strand : Factor w/ 3 levels "+","-","*":
..$ nucleotide: Factor w/ 8 levels "A","C","G","T",..:
..$ count : int(0)
..$ sample_id : chr(0)
我不明白为什么会这样,因为我已经在一个包含所有数字(当然,行数不如这个)的虚构数据框上使用它并且它有效。
我该如何解决这个问题?
只是'seqnames'列是factor
,所以有很多未使用的级别。使用 split
,可以选择 drop
(drop = TRUE
- 默认情况下是 FALSE
)来删除这些列表元素。否则,他们将 return 视为 data.frame
0 行。如果我们想让那些元素被NULL
替换,那么找到那些行数(nrow
)为0的元素赋值给NULL
data_list <- split(data,data$seqnames)
> str(data_list)
List of 5
$ id1:'data.frame': 2 obs. of 5 variables:
..$ seqnames : Factor w/ 5 levels "id1","id2","id3",..: 1 1
..$ pos : int [1:2] 12 13
..$ strand : chr [1:2] "+" "+"
..$ nucleotide: chr [1:2] "A" "C"
..$ count : int [1:2] 13 25
$ id2:'data.frame': 3 obs. of 5 variables:
..$ seqnames : Factor w/ 5 levels "id1","id2","id3",..: 2 2 2
..$ pos : int [1:3] 24 25 26
..$ strand : chr [1:3] "+" "+" "+"
..$ nucleotide: chr [1:3] "G" "T" "A"
..$ count : int [1:3] 10 25 10
$ id3:'data.frame': 1 obs. of 5 variables:
..$ seqnames : Factor w/ 5 levels "id1","id2","id3",..: 3
..$ pos : int 10
..$ strand : chr "+"
..$ nucleotide: chr "C"
..$ count : int 5
$ id4:'data.frame': 0 obs. of 5 variables:
..$ seqnames : Factor w/ 5 levels "id1","id2","id3",..:
..$ pos : int(0)
..$ strand : chr(0)
..$ nucleotide: chr(0)
..$ count : int(0)
$ id5:'data.frame': 0 obs. of 5 variables:
..$ seqnames : Factor w/ 5 levels "id1","id2","id3",..:
..$ pos : int(0)
..$ strand : chr(0)
..$ nucleotide: chr(0)
..$ count : int(0)
正在对 NULL
进行作业
data_list[sapply(data_list, nrow) == 0] <- list(NULL)
-再次检查
> str(data_list)
List of 5
$ id1:'data.frame': 2 obs. of 5 variables:
..$ seqnames : Factor w/ 5 levels "id1","id2","id3",..: 1 1
..$ pos : int [1:2] 12 13
..$ strand : chr [1:2] "+" "+"
..$ nucleotide: chr [1:2] "A" "C"
..$ count : int [1:2] 13 25
$ id2:'data.frame': 3 obs. of 5 variables:
..$ seqnames : Factor w/ 5 levels "id1","id2","id3",..: 2 2 2
..$ pos : int [1:3] 24 25 26
..$ strand : chr [1:3] "+" "+" "+"
..$ nucleotide: chr [1:3] "G" "T" "A"
..$ count : int [1:3] 10 25 10
$ id3:'data.frame': 1 obs. of 5 variables:
..$ seqnames : Factor w/ 5 levels "id1","id2","id3",..: 3
..$ pos : int 10
..$ strand : chr "+"
..$ nucleotide: chr "C"
..$ count : int 5
$ id4: NULL
$ id5: NULL
数据
data <- structure(list(seqnames = structure(c(1L, 1L, 2L, 2L, 2L,
3L), .Label = c("id1",
"id2", "id3", "id4", "id5"), class = "factor"), pos = c(12L,
13L, 24L, 25L, 26L, 10L), strand = c("+", "+", "+", "+", "+",
"+"), nucleotide = c("A", "C", "G", "T", "A", "C"), count = c(13L,
25L, 10L, 25L, 10L, 5L)), row.names = c(NA, -6L), class = "data.frame")
我有一个这样的数据框:
seqnames pos strand nucleotide count
id1 12 + A 13
id1 13 + C 25
id2 24 + G 10
id2 25 + T 25
id2 26 + A 10
id3 10 + C 5
但是总共有10万多行,seqnames
有3138层。我想根据 seqnames 将它拆分成数据帧列表,所以我使用了 split 函数:
data_list <- split(data,data$seqnames)
但它只是 returns 像这样的东西:
List of 3138
$ id1:'data.frame': 0 obs. of 6 variables:
..$ seqnames : Factor w/ 3138 levels "id1","id2",..:
..$ pos : int(0)
..$ strand : Factor w/ 3 levels "+","-","*":
..$ nucleotide: Factor w/ 8 levels "A","C","G","T",..:
..$ count : int(0)
..$ sample_id : chr(0)
$ id2:'data.frame': 0 obs. of 6 variables:
..$ seqnames : Factor w/ 3138 levels "id1","id2",..:
..$ pos : int(0)
..$ strand : Factor w/ 3 levels "+","-","*":
..$ nucleotide: Factor w/ 8 levels "A","C","G","T",..:
..$ count : int(0)
..$ sample_id : chr(0)
我不明白为什么会这样,因为我已经在一个包含所有数字(当然,行数不如这个)的虚构数据框上使用它并且它有效。 我该如何解决这个问题?
只是'seqnames'列是factor
,所以有很多未使用的级别。使用 split
,可以选择 drop
(drop = TRUE
- 默认情况下是 FALSE
)来删除这些列表元素。否则,他们将 return 视为 data.frame
0 行。如果我们想让那些元素被NULL
替换,那么找到那些行数(nrow
)为0的元素赋值给NULL
data_list <- split(data,data$seqnames)
> str(data_list)
List of 5
$ id1:'data.frame': 2 obs. of 5 variables:
..$ seqnames : Factor w/ 5 levels "id1","id2","id3",..: 1 1
..$ pos : int [1:2] 12 13
..$ strand : chr [1:2] "+" "+"
..$ nucleotide: chr [1:2] "A" "C"
..$ count : int [1:2] 13 25
$ id2:'data.frame': 3 obs. of 5 variables:
..$ seqnames : Factor w/ 5 levels "id1","id2","id3",..: 2 2 2
..$ pos : int [1:3] 24 25 26
..$ strand : chr [1:3] "+" "+" "+"
..$ nucleotide: chr [1:3] "G" "T" "A"
..$ count : int [1:3] 10 25 10
$ id3:'data.frame': 1 obs. of 5 variables:
..$ seqnames : Factor w/ 5 levels "id1","id2","id3",..: 3
..$ pos : int 10
..$ strand : chr "+"
..$ nucleotide: chr "C"
..$ count : int 5
$ id4:'data.frame': 0 obs. of 5 variables:
..$ seqnames : Factor w/ 5 levels "id1","id2","id3",..:
..$ pos : int(0)
..$ strand : chr(0)
..$ nucleotide: chr(0)
..$ count : int(0)
$ id5:'data.frame': 0 obs. of 5 variables:
..$ seqnames : Factor w/ 5 levels "id1","id2","id3",..:
..$ pos : int(0)
..$ strand : chr(0)
..$ nucleotide: chr(0)
..$ count : int(0)
正在对 NULL
data_list[sapply(data_list, nrow) == 0] <- list(NULL)
-再次检查
> str(data_list)
List of 5
$ id1:'data.frame': 2 obs. of 5 variables:
..$ seqnames : Factor w/ 5 levels "id1","id2","id3",..: 1 1
..$ pos : int [1:2] 12 13
..$ strand : chr [1:2] "+" "+"
..$ nucleotide: chr [1:2] "A" "C"
..$ count : int [1:2] 13 25
$ id2:'data.frame': 3 obs. of 5 variables:
..$ seqnames : Factor w/ 5 levels "id1","id2","id3",..: 2 2 2
..$ pos : int [1:3] 24 25 26
..$ strand : chr [1:3] "+" "+" "+"
..$ nucleotide: chr [1:3] "G" "T" "A"
..$ count : int [1:3] 10 25 10
$ id3:'data.frame': 1 obs. of 5 variables:
..$ seqnames : Factor w/ 5 levels "id1","id2","id3",..: 3
..$ pos : int 10
..$ strand : chr "+"
..$ nucleotide: chr "C"
..$ count : int 5
$ id4: NULL
$ id5: NULL
数据
data <- structure(list(seqnames = structure(c(1L, 1L, 2L, 2L, 2L,
3L), .Label = c("id1",
"id2", "id3", "id4", "id5"), class = "factor"), pos = c(12L,
13L, 24L, 25L, 26L, 10L), strand = c("+", "+", "+", "+", "+",
"+"), nucleotide = c("A", "C", "G", "T", "A", "C"), count = c(13L,
25L, 10L, 25L, 10L, 5L)), row.names = c(NA, -6L), class = "data.frame")