取重复条目得分的平均值并以宽格式转换
Take average for duplicated entries score and convert in wide format
我想以宽格式重塑数据,但我想取与第四列条目关联的第三列的平均值。喜欢(0.21+0.05+0.06)/total
。
我已经阅读了 R 中的 reshape
包,但我不知道在转换为宽格式
之前使用哪个聚合函数找到平均值
输入数据帧
CID100000085 C0000737 0.21 Abdominal pain
CID100000085 C0000737 0.21 Gastrointestinal pain
CID100000085 C0000737 0.05 Abdominal pain
CID100000085 C0000737 0.05 Gastrointestinal pain
CID100000085 C0000737 0.06 Abdominal pain
CID100000085 C0000737 0.06 Gastrointestinal pain
期望的输出
Abdominal pain Gastrointestinal pain
CID100000085 C0000737 0.0166 0.0166
您可以尝试 aggregate
和 reshape
在基数 R:
reshape(aggregate(V3~V1+V2+V4, df, mean),
idvar = "V1", timevar = "V4", direction = "wide")[,-4]
# V1 V2.Abdominalpain V3.Abdominalpain V3.Gastrointestinalpain
#1 CID100000085 C0000737 0.1066667 0.1066667
数据
df <- structure(list(V1 = structure(c(1L, 1L, 1L, 1L, 1L, 1L), .Label = "CID100000085", class = "factor"),
V2 = structure(c(1L, 1L, 1L, 1L, 1L, 1L), .Label = "C0000737", class = "factor"),
V3 = c(0.21, 0.21, 0.05, 0.05, 0.06, 0.06), V4 = structure(c(1L,
2L, 1L, 2L, 1L, 2L), .Label = c("Abdominalpain", "Gastrointestinalpain"
), class = "factor")), .Names = c("V1", "V2", "V3", "V4"), class = "data.frame", row.names = c(NA,
-6L))
我们可以直接使用dcast
library(data.table)
dcast(setDT(df1), id1+id2~pain, value.var = "value", mean)
# id1 id2 Abdominal pain Gastrointestinal pain
#1: CID100000085 C0000737 0.1066667 0.1066667
数据
df1 <- structure(list(id1 = c("CID100000085", "CID100000085", "CID100000085",
"CID100000085", "CID100000085", "CID100000085"), id2 = c("C0000737",
"C0000737", "C0000737", "C0000737", "C0000737", "C0000737"),
value = c(0.21, 0.21, 0.05, 0.05, 0.06, 0.06), pain = c("Abdominal pain",
"Gastrointestinal pain", "Abdominal pain", "Gastrointestinal pain",
"Abdominal pain", "Gastrointestinal pain")),
.Names = c("id1",
"id2", "value", "pain"), class = "data.frame", row.names = c(NA,
-6L))
我想以宽格式重塑数据,但我想取与第四列条目关联的第三列的平均值。喜欢(0.21+0.05+0.06)/total
。
我已经阅读了 R 中的 reshape
包,但我不知道在转换为宽格式
输入数据帧
CID100000085 C0000737 0.21 Abdominal pain
CID100000085 C0000737 0.21 Gastrointestinal pain
CID100000085 C0000737 0.05 Abdominal pain
CID100000085 C0000737 0.05 Gastrointestinal pain
CID100000085 C0000737 0.06 Abdominal pain
CID100000085 C0000737 0.06 Gastrointestinal pain
期望的输出
Abdominal pain Gastrointestinal pain
CID100000085 C0000737 0.0166 0.0166
您可以尝试 aggregate
和 reshape
在基数 R:
reshape(aggregate(V3~V1+V2+V4, df, mean),
idvar = "V1", timevar = "V4", direction = "wide")[,-4]
# V1 V2.Abdominalpain V3.Abdominalpain V3.Gastrointestinalpain
#1 CID100000085 C0000737 0.1066667 0.1066667
数据
df <- structure(list(V1 = structure(c(1L, 1L, 1L, 1L, 1L, 1L), .Label = "CID100000085", class = "factor"),
V2 = structure(c(1L, 1L, 1L, 1L, 1L, 1L), .Label = "C0000737", class = "factor"),
V3 = c(0.21, 0.21, 0.05, 0.05, 0.06, 0.06), V4 = structure(c(1L,
2L, 1L, 2L, 1L, 2L), .Label = c("Abdominalpain", "Gastrointestinalpain"
), class = "factor")), .Names = c("V1", "V2", "V3", "V4"), class = "data.frame", row.names = c(NA,
-6L))
我们可以直接使用dcast
library(data.table)
dcast(setDT(df1), id1+id2~pain, value.var = "value", mean)
# id1 id2 Abdominal pain Gastrointestinal pain
#1: CID100000085 C0000737 0.1066667 0.1066667
数据
df1 <- structure(list(id1 = c("CID100000085", "CID100000085", "CID100000085",
"CID100000085", "CID100000085", "CID100000085"), id2 = c("C0000737",
"C0000737", "C0000737", "C0000737", "C0000737", "C0000737"),
value = c(0.21, 0.21, 0.05, 0.05, 0.06, 0.06), pain = c("Abdominal pain",
"Gastrointestinal pain", "Abdominal pain", "Gastrointestinal pain",
"Abdominal pain", "Gastrointestinal pain")),
.Names = c("id1",
"id2", "value", "pain"), class = "data.frame", row.names = c(NA,
-6L))