R 中的 Wrangle 数据框,可能使用 dcast
Wrangle dataframe in R, possibly with dcast
我有一个 data.frame 非常大,我不得不稍微整理一下。当前结构是:
V1 V2 V3 V4 V5 V6 V7 V8 ... Vn Vn+1
chr1 1 A T sample_1 value_1 sample_2 value_4 ... sample_n value_7
chr1 40 T C sample_1 value_2 sample_2 value_5 ... sample_n value_8
chr1 60 A T sample_1 value_3 sample_2 value_6 ... sample_n value_9
.
.
.
chrX 160 A T sample_1 value_x sample_2 value_y ... sample_n value_ni
例如对于 data_frame:
df <- structure(list(V1 = c(10L, 10L, 10L, 10L, 10L, 10L), V2 = c(3387501L,
4174142L, 6419754L, 6419765L, 6419897L, 6419912L), V3 = c("T",
"A", "C", "T", "G", "A"), V4 = c("A",
"T", "A", "A", "C", "G"), V5 = c("LP2000748-DNA_H02",
"LP2000748-DNA_H02", "LP2000748-DNA_H02", "LP2000748-DNA_H02",
"LP2000748-DNA_H02", "LP2000748-DNA_H02"), V6 = c("0/0", "0/0",
"1/1", "0/0", "0/0", "0/0"), V7 = c("LP2000748-DNA_A03", "LP2000748-DNA_A03",
"LP2000748-DNA_A03", "LP2000748-DNA_A03", "LP2000748-DNA_A03",
"LP2000748-DNA_A03"), V8 = c("0/0", "0/0", "1/1", "0/1", "0/0",
"0/0"), V9 = c("LP2000795-DNA_B01", "LP2000795-DNA_B01", "LP2000795-DNA_B01",
"LP2000795-DNA_B01", "LP2000795-DNA_B01", "LP2000795-DNA_B01"
), V10 = c("0/0", "0/0", "1/1", "0/0", "0/0", "0/0")), row.names = c(NA,
-6L), class = c("data.table", "data.frame"))
最后我想要的是这样的table:
V1 V2 V3 V4 sample_1 sample_2 ... sample_n
chr1 1 A T value_1 value_4 ... value_7
chr1 40 T C value_2 value_5 ... value_8
chr1 60 A T value_3 value_6 ... value_9
.
.
.
chrX 160 A T value_x value_y ... value_ni
到目前为止我在 R 中尝试的是:
samples_data <- seq(from = 5, to = dim(df)[2],by=2)
variable_data <- samples_data + 1
new_df <- reshape2::dcast(df, V1 + V2 + V3 ~ colnames(df)[samples_data], value.var= colnames(df)[variable_data])
但我收到此错误消息:
recursive indexing failed at level 2
In addition: Warning message:
In if (!(value.var %in% names(data))) { :
the condition has length > 1 and only the first element will be used
有没有人对如何解决这个问题或如何重塑 df 有任何建议?
谢谢!
您可能需要取消嵌套数据,然后使用 reshape
。要取消嵌套,您可以使用 Map
生成一个列表,选择前四个 ID 列,并从其余列中选择模式 5,6; 7,8; 9,10。 rbind
结果和 reshape
.
cseq <- 5:ncol(df)
tmp <- do.call(rbind, Map(function(x, y) setNames(df[c(1:4, x:y)],
c(names(df)[1:4], c("sample", "value"))),
cseq[cseq %% 2 != 0], cseq[cseq %% 2 == 0]))
res <- reshape(tmp, idvar=1:4, timevar="sample", v.names="value", direction="wide")
res
# V1 V2 V3 V4 value.LP2000748-DNA_H02 value.LP2000748-DNA_A03 value.LP2000795-DNA_B01
# 1 10 3387501 T A 0/0 0/0 0/0
# 2 10 4174142 A T 0/0 0/0 0/0
# 3 10 6419754 C A 1/1 1/1 1/1
# 4 10 6419765 T A 0/0 0/1 0/0
# 5 10 6419897 G C 0/0 0/0 0/0
# 6 10 6419912 A G 0/0 0/0 0/0
我有一个 data.frame 非常大,我不得不稍微整理一下。当前结构是:
V1 V2 V3 V4 V5 V6 V7 V8 ... Vn Vn+1
chr1 1 A T sample_1 value_1 sample_2 value_4 ... sample_n value_7
chr1 40 T C sample_1 value_2 sample_2 value_5 ... sample_n value_8
chr1 60 A T sample_1 value_3 sample_2 value_6 ... sample_n value_9
.
.
.
chrX 160 A T sample_1 value_x sample_2 value_y ... sample_n value_ni
例如对于 data_frame:
df <- structure(list(V1 = c(10L, 10L, 10L, 10L, 10L, 10L), V2 = c(3387501L,
4174142L, 6419754L, 6419765L, 6419897L, 6419912L), V3 = c("T",
"A", "C", "T", "G", "A"), V4 = c("A",
"T", "A", "A", "C", "G"), V5 = c("LP2000748-DNA_H02",
"LP2000748-DNA_H02", "LP2000748-DNA_H02", "LP2000748-DNA_H02",
"LP2000748-DNA_H02", "LP2000748-DNA_H02"), V6 = c("0/0", "0/0",
"1/1", "0/0", "0/0", "0/0"), V7 = c("LP2000748-DNA_A03", "LP2000748-DNA_A03",
"LP2000748-DNA_A03", "LP2000748-DNA_A03", "LP2000748-DNA_A03",
"LP2000748-DNA_A03"), V8 = c("0/0", "0/0", "1/1", "0/1", "0/0",
"0/0"), V9 = c("LP2000795-DNA_B01", "LP2000795-DNA_B01", "LP2000795-DNA_B01",
"LP2000795-DNA_B01", "LP2000795-DNA_B01", "LP2000795-DNA_B01"
), V10 = c("0/0", "0/0", "1/1", "0/0", "0/0", "0/0")), row.names = c(NA,
-6L), class = c("data.table", "data.frame"))
最后我想要的是这样的table:
V1 V2 V3 V4 sample_1 sample_2 ... sample_n
chr1 1 A T value_1 value_4 ... value_7
chr1 40 T C value_2 value_5 ... value_8
chr1 60 A T value_3 value_6 ... value_9
.
.
.
chrX 160 A T value_x value_y ... value_ni
到目前为止我在 R 中尝试的是:
samples_data <- seq(from = 5, to = dim(df)[2],by=2) variable_data <- samples_data + 1
new_df <- reshape2::dcast(df, V1 + V2 + V3 ~ colnames(df)[samples_data], value.var= colnames(df)[variable_data])
但我收到此错误消息:
recursive indexing failed at level 2 In addition: Warning message: In if (!(value.var %in% names(data))) { : the condition has length > 1 and only the first element will be used
有没有人对如何解决这个问题或如何重塑 df 有任何建议?
谢谢!
您可能需要取消嵌套数据,然后使用 reshape
。要取消嵌套,您可以使用 Map
生成一个列表,选择前四个 ID 列,并从其余列中选择模式 5,6; 7,8; 9,10。 rbind
结果和 reshape
.
cseq <- 5:ncol(df)
tmp <- do.call(rbind, Map(function(x, y) setNames(df[c(1:4, x:y)],
c(names(df)[1:4], c("sample", "value"))),
cseq[cseq %% 2 != 0], cseq[cseq %% 2 == 0]))
res <- reshape(tmp, idvar=1:4, timevar="sample", v.names="value", direction="wide")
res
# V1 V2 V3 V4 value.LP2000748-DNA_H02 value.LP2000748-DNA_A03 value.LP2000795-DNA_B01
# 1 10 3387501 T A 0/0 0/0 0/0
# 2 10 4174142 A T 0/0 0/0 0/0
# 3 10 6419754 C A 1/1 1/1 1/1
# 4 10 6419765 T A 0/0 0/1 0/0
# 5 10 6419897 G C 0/0 0/0 0/0
# 6 10 6419912 A G 0/0 0/0 0/0