根据一列中的相似值重塑 data.frame
reshape a data.frame based on similar value in one column
我有一个包含 2 列的 data.frame
,其中重复了第二列中的值。例如:
HUGO Cell
1 CD28 T cells
2 CD3D T cells
3 CD3G T cells
4 CD8A lymphocytes
5 EOMES lymphocytes
6 FGFBP2 lymphocytes
7 GNLY lymphocytes
8 NCR1 NK cells
9 PTGDR NK cells
10 SH2D1B NK cells
我希望 HUGO 列中与列单元格中的唯一名称相对应的所有值都进入名称列表中的每个唯一名称之后。
例如
T cells: CD28 CC3D C34
lymphocytes: CD8A EOMES FGFBP2 FGFBP2 GNLY
...
我试过了
reshape(data.frame, timevar = "HUGO",idvar = "Cell",direction = "wide")
但它只是 returns 单元格列中每个名称的值数。
根据您的需要,这里有一些可能性。前 5 个不使用包。
1) aggregate/c 这给出了一个数据框,其第二列是 HUGO 名称的字符向量。
aggregate(HUGO ~ Cell, DF, c)
给予:
Cell HUGO
1 lymphocytes CD8A, EOMES, FGFBP2, GNLY
2 NK cells NCR1, PTGDR, SH2D1B
3 T cells CD28, CD3D, CD3G
2) aggregate/toString 这给出了一个数据框,其第二列包含用逗号分隔 HUGO 名称的字符串。
aggregate(HUGO ~ Cell, DF, toString)
给予:
Cell HUGO
1 lymphocytes CD8A, EOMES, FGFBP2, GNLY
2 NK cells NCR1, PTGDR, SH2D1B
3 T cells CD28, CD3D, CD3G
3) unstack 这给出了一个列表,每个 Cell 一个组件,每个组件都是该 Cell 的 HUGO 名称。
unstack(DF)
给予:
$lymphocytes
[1] "CD8A" "EOMES" "FGFBP2" "GNLY"
$`NK cells`
[1] "NCR1" "PTGDR" "SH2D1B"
$`T cells`
[1] "CD28" "CD3D" "CD3G"
4) tapply 这给出了一个矩阵,其行是Cells,其列是HUGO名字的序号。
DF2 <- transform(DF, seq = ave(seq_along(HUGO), Cell, FUN t= seq_along))
tapply(DF2$HUGO, DF2[-1], c)
给予:
seq
Cell 1 2 3 4
lymphocytes "CD8A" "EOMES" "FGFBP2" "GNLY"
NK cells "NCR1" "PTGDR" "SH2D1B" NA
T cells "CD28" "CD3D" "CD3G" NA
5) reshape 这使用最后一个替代方案中的 DF2
和 reshape
来给出一个数据框:
reshape(DF2, timevar = "seq", idvar = "Cell", dir = "wide")
给予:
Cell HUGO.1 HUGO.2 HUGO.3 HUGO.4
1 T cells CD28 CD3D CD3G <NA>
4 lymphocytes CD8A EOMES FGFBP2 GNLY
8 NK cells NCR1 PTGDR SH2D1B <NA>
6) spread 这给出了一个 "tbl_df"
class 对象作为输出(它是 "data.frame"
的子class )
library(dplyr)
library(tidyr)
DF %>%
group_by(Cell) %>%
mutate(seq = 1:n()) %>%
ungroup() %>%
spread(seq, HUGO)
给予:
Cell 1 2 3 4
1 lymphocytes CD8A EOMES FGFBP2 GNLY
2 NK cells NCR1 PTGDR SH2D1B <NA>
3 T cells CD28 CD3D CD3G <NA>
7) read.zoo read.zoo
给出一个动物园对象,其时间是细胞。
由于时间实际上是字符串,我们使用 FUN=identity
来避免解释。 fortify.zoo
将其转换为数据框。 DF2
来自上方。
library(zoo)
fortify.zoo(read.zoo(DF2, split = "seq", index = "Cell", FUN = identity))
给予:
Index 1 2 3 4
1 lymphocytes CD8A EOMES FGFBP2 GNLY
2 NK cells NCR1 PTGDR SH2D1B <NA>
3 T cells CD28 CD3D CD3G <NA>
8) dcast 这给出了一个 data.table 作为输出。
library(data.table)
DT <- data.table(DF)
DT[, seq:=1:.N, by = Cell]
dcast(DT, Cell ~ seq, value.var = "HUGO")
给予:
Cell 1 2 3 4
1: NK cells NCR1 PTGDR SH2D1B NA
2: T cells CD28 CD3D CD3G NA
3: lymphocytes CD8A EOMES FGFBP2 GNLY
注:
DF <- structure(list(HUGO = c("CD28", "CD3D", "CD3G", "CD8A", "EOMES",
"FGFBP2", "GNLY", "NCR1", "PTGDR", "SH2D1B"), Cell = c("T cells",
"T cells", "T cells", "lymphocytes", "lymphocytes", "lymphocytes",
"lymphocytes", "NK cells", "NK cells", "NK cells")), .Names = c("HUGO",
"Cell"), class = "data.frame", row.names = c(NA, -10L))
我有一个包含 2 列的 data.frame
,其中重复了第二列中的值。例如:
HUGO Cell
1 CD28 T cells
2 CD3D T cells
3 CD3G T cells
4 CD8A lymphocytes
5 EOMES lymphocytes
6 FGFBP2 lymphocytes
7 GNLY lymphocytes
8 NCR1 NK cells
9 PTGDR NK cells
10 SH2D1B NK cells
我希望 HUGO 列中与列单元格中的唯一名称相对应的所有值都进入名称列表中的每个唯一名称之后。
例如
T cells: CD28 CC3D C34
lymphocytes: CD8A EOMES FGFBP2 FGFBP2 GNLY
...
我试过了
reshape(data.frame, timevar = "HUGO",idvar = "Cell",direction = "wide")
但它只是 returns 单元格列中每个名称的值数。
根据您的需要,这里有一些可能性。前 5 个不使用包。
1) aggregate/c 这给出了一个数据框,其第二列是 HUGO 名称的字符向量。
aggregate(HUGO ~ Cell, DF, c)
给予:
Cell HUGO
1 lymphocytes CD8A, EOMES, FGFBP2, GNLY
2 NK cells NCR1, PTGDR, SH2D1B
3 T cells CD28, CD3D, CD3G
2) aggregate/toString 这给出了一个数据框,其第二列包含用逗号分隔 HUGO 名称的字符串。
aggregate(HUGO ~ Cell, DF, toString)
给予:
Cell HUGO
1 lymphocytes CD8A, EOMES, FGFBP2, GNLY
2 NK cells NCR1, PTGDR, SH2D1B
3 T cells CD28, CD3D, CD3G
3) unstack 这给出了一个列表,每个 Cell 一个组件,每个组件都是该 Cell 的 HUGO 名称。
unstack(DF)
给予:
$lymphocytes
[1] "CD8A" "EOMES" "FGFBP2" "GNLY"
$`NK cells`
[1] "NCR1" "PTGDR" "SH2D1B"
$`T cells`
[1] "CD28" "CD3D" "CD3G"
4) tapply 这给出了一个矩阵,其行是Cells,其列是HUGO名字的序号。
DF2 <- transform(DF, seq = ave(seq_along(HUGO), Cell, FUN t= seq_along))
tapply(DF2$HUGO, DF2[-1], c)
给予:
seq
Cell 1 2 3 4
lymphocytes "CD8A" "EOMES" "FGFBP2" "GNLY"
NK cells "NCR1" "PTGDR" "SH2D1B" NA
T cells "CD28" "CD3D" "CD3G" NA
5) reshape 这使用最后一个替代方案中的 DF2
和 reshape
来给出一个数据框:
reshape(DF2, timevar = "seq", idvar = "Cell", dir = "wide")
给予:
Cell HUGO.1 HUGO.2 HUGO.3 HUGO.4
1 T cells CD28 CD3D CD3G <NA>
4 lymphocytes CD8A EOMES FGFBP2 GNLY
8 NK cells NCR1 PTGDR SH2D1B <NA>
6) spread 这给出了一个 "tbl_df"
class 对象作为输出(它是 "data.frame"
的子class )
library(dplyr)
library(tidyr)
DF %>%
group_by(Cell) %>%
mutate(seq = 1:n()) %>%
ungroup() %>%
spread(seq, HUGO)
给予:
Cell 1 2 3 4
1 lymphocytes CD8A EOMES FGFBP2 GNLY
2 NK cells NCR1 PTGDR SH2D1B <NA>
3 T cells CD28 CD3D CD3G <NA>
7) read.zoo read.zoo
给出一个动物园对象,其时间是细胞。
由于时间实际上是字符串,我们使用 FUN=identity
来避免解释。 fortify.zoo
将其转换为数据框。 DF2
来自上方。
library(zoo)
fortify.zoo(read.zoo(DF2, split = "seq", index = "Cell", FUN = identity))
给予:
Index 1 2 3 4
1 lymphocytes CD8A EOMES FGFBP2 GNLY
2 NK cells NCR1 PTGDR SH2D1B <NA>
3 T cells CD28 CD3D CD3G <NA>
8) dcast 这给出了一个 data.table 作为输出。
library(data.table)
DT <- data.table(DF)
DT[, seq:=1:.N, by = Cell]
dcast(DT, Cell ~ seq, value.var = "HUGO")
给予:
Cell 1 2 3 4
1: NK cells NCR1 PTGDR SH2D1B NA
2: T cells CD28 CD3D CD3G NA
3: lymphocytes CD8A EOMES FGFBP2 GNLY
注:
DF <- structure(list(HUGO = c("CD28", "CD3D", "CD3G", "CD8A", "EOMES",
"FGFBP2", "GNLY", "NCR1", "PTGDR", "SH2D1B"), Cell = c("T cells",
"T cells", "T cells", "lymphocytes", "lymphocytes", "lymphocytes",
"lymphocytes", "NK cells", "NK cells", "NK cells")), .Names = c("HUGO",
"Cell"), class = "data.frame", row.names = c(NA, -10L))