根据特定条件将行转换为列
Transform rows to columns based on certain criteria
我有一个包含属性及其对应值的数据集,如下所示
Obs# Id Class Date MedicationName Dose BloodTestResult
1 1433 1 2007/01/01 Sitaglyptin 100mg 6.2
2 1433 1 2007/03/24 Sitaglyptin 100mg 6.4
3 1433 1 2007/06/15 Sitaglyptin 100mg 6.5
4 1433 2 2007/09/25 Glucophage 10mg 6.7
5 1433 2 2007/12/30 Glucophage 10mg 6.5
6 1433 2 2008/02/01 Glucophage 10mg 6.6
7 1433 3 2008/05/03 Glumetza 10mg 7.2
8 1433 3 2008/08/10 Glumetza 10mg 6.4
9 1433 3 2008/11/14 Glumetza 20mg 6.7
10 1433 3 2009/02/02 Glumetza 20mg 6.5
11 8348 3 2007/04/11 Glumetza 20mg 6.5
12 8348 3 2007/07/15 Glumetza 20mg 6.6
我喜欢将其转换成这样的数据集
Obs# Id Class Date1 MedicationName1 Dose1 Date2 MedicationName2 Dose2 Date3 MedicationName3 Dose3 BloodTestResult
1 1433 1 2007/01/01 Sitaglyptin 100mg 2007/03/24 Sitaglyptin 100mg 2007/09/25 Glucophage 100mg 6.7
2 1433 2 2007/09/25 Glucophage 10mg 2007/12/30 Glucophage 10mg 2008/02/01 Glucophage 10mg 7.2
3 1433 3 2008/05/03 Glumetza 10mg 2008/08/10 Glumetza 10mg - - - 6.7
4 1433 3 2008/11/14 Glumetza 20mg 2009/02/02 Glumetza 20mg - - - 6.5
5 8348 3 2007/04/11 Glumetza 20mg 2007/07/15 Glumetza 20mg - - - 6.6
上面的数据集根据这些标准中的任何一个从行转换为列。
场景 1) 药物变化 (MedicantionName) 或剂量变化 (Dose)
Observations 1,2,3 are same Medications (Sitaglyptin) and same dose (100mg).
So these three rows (1,2,3) are transformed into one row (row 1) as
shown in the tranformed dataset and
The last column BloodTestResults will contain the value from 4th row (6.7).
Similarly rows 4,5,6 because of Medication change(Glucophage). These
three rows 4,5,6 are transformed to a single row 2 as shown in the new
dataset and
The last column BloodTestResults will contain the value from 7th row (7.2).
Similarly rows 7 and 8 because of Medication change (Glumetza). These
two rows 7,8 are transformed to a single row 3 as shown in the new
dataset and
The last column BloodTestResults will contain the value from 9th row (6.7).
场景 2) 药物变化 (MedicantionName) 或剂量变化 (Dose)
Rows 9, 10 are transformed to a single row 4 as shown in the new dataset
because of dosage change(20mg) and
The last column BloodTestResults will contain the value from 10th row
(6.5) and not 11th row because this is the last
medication/dosage change for the id 1433
场景 3) 该 patientId 记录的最后一次用药
Rows 11,12 represent the only or last available information regarding
id 8348. So they are just transformed to single row 5 as shown in the
transformed dataset and
The last column BloodTestResults will contain the value from 12th row
(6.6) because this is the last
medication/dosage change for the id 8348
如果这很混乱,我深表歉意,希望我已经清楚地解释了转换此数据集的模式。感谢您根据这些要求转换此数据集的任何帮助。
数据
df <- structure(list(Obs = 1:12, Id = c(1433L, 1433L, 1433L, 1433L,
1433L, 1433L, 1433L, 1433L, 1433L, 1433L, 8348L, 8348L), Class = c(1L,
1L, 1L, 2L, 2L, 2L, 3L, 3L, 3L, 3L, 3L, 3L), Date = structure(c(1L,
2L, 4L, 6L, 7L, 8L, 9L, 10L, 11L, 12L, 3L, 5L), .Label = c("2007/01/01",
"2007/03/24", "2007/04/11", "2007/06/15", "2007/07/15", "2007/09/25",
"2007/12/30", "2008/02/01", "2008/05/03", "2008/08/10", "2008/11/14",
"2009/02/02"), class = "factor"), MedicationName = structure(c(3L,
3L, 3L, 1L, 1L, 1L, 2L, 2L, 2L, 2L, 2L, 2L), .Label = c("Glucophage",
"Glumetza", "Sitaglyptin"), class = "factor"), Dose = structure(c(1L,
1L, 1L, 2L, 2L, 2L, 2L, 2L, 3L, 3L, 3L, 3L), .Label = c("100mg",
"10mg", "20mg"), class = "factor"), BloodTestResult = c(6.2,
6.4, 6.5, 6.7, 6.5, 6.6, 7.2, 6.4, 6.7, 6.5, 6.5, 6.6)), .Names = c("Obs",
"Id", "Class", "Date", "MedicationName", "Dose", "BloodTestResult"
), class = "data.frame", row.names = c(NA, -12L))
这是一种棘手的数据转换,尤其是 BloodTestResult,因为它需要 Id、Class(或 MedicationName)和 Dose 的初始分组之外的数据。将其分成几步,您可以尝试以下操作,(我将数据称为 dat
)
## First split data: Id, Class and Dose
groups <- split(dat, interaction(dat$Id, dat$Class, dat$Dose, drop=T))
## Then, for each grouping, split by rows the columns you want to expand
tmp <- lapply(groups, function(x)
cbind(x[1,1:3], do.call(cbind, split(x[,-c(1:3, ncol(x))], 1:nrow(x)))))
## Put back into data.frame
library(plyr) # for rbind.fill, since some data.frames are missing columns
res <- do.call(rbind.fill, tmp)
## Finally, add the bloodtest
res$BloodTestResult <- unlist(sapply(split(dat, dat$Id), function(x)
c(x$BloodTestResult[c(F, !(tail(x$Dose, -1) == head(x$Dose, -1) &
tail(x$Class, -1) == head(x$Class, -1)))],
tail(x$BloodTestResult, 1))))
# Obs Id Class 1.Date 1.MedicationName 1.Dose 2.Date 2.MedicationName
# 1 1 1433 1 2007/01/01 Sitaglyptin 100mg 2007/03/24 Sitaglyptin
# 2 4 1433 2 2007/09/25 Glucophage 10mg 2007/12/30 Glucophage
# 3 7 1433 3 2008/05/03 Glumetza 10mg 2008/08/10 Glumetza
# 4 9 1433 3 2008/11/14 Glumetza 20mg 2009/02/02 Glumetza
# 5 11 8348 3 2007/04/11 Glumetza 20mg 2007/07/15 Glumetza
# 2.Dose 3.Date 3.MedicationName 3.Dose BloodTestResult
# 1 100mg 2007/06/15 Sitaglyptin 100mg 6.7
# 2 10mg 2008/02/01 Glucophage 10mg 7.2
# 3 10mg <NA> <NA> <NA> 6.7
# 4 20mg <NA> <NA> <NA> 6.5
# 5 20mg <NA> <NA> <NA> 6.6
BloodTest 列的计算方法是首先按 Id 拆分数据,然后查找 Dose 或 Class 的变化,并在这些位置提取 BloodTestResult,然后合并每个 Id 的最后一个 BloodTestValue。
我有一个包含属性及其对应值的数据集,如下所示
Obs# Id Class Date MedicationName Dose BloodTestResult
1 1433 1 2007/01/01 Sitaglyptin 100mg 6.2
2 1433 1 2007/03/24 Sitaglyptin 100mg 6.4
3 1433 1 2007/06/15 Sitaglyptin 100mg 6.5
4 1433 2 2007/09/25 Glucophage 10mg 6.7
5 1433 2 2007/12/30 Glucophage 10mg 6.5
6 1433 2 2008/02/01 Glucophage 10mg 6.6
7 1433 3 2008/05/03 Glumetza 10mg 7.2
8 1433 3 2008/08/10 Glumetza 10mg 6.4
9 1433 3 2008/11/14 Glumetza 20mg 6.7
10 1433 3 2009/02/02 Glumetza 20mg 6.5
11 8348 3 2007/04/11 Glumetza 20mg 6.5
12 8348 3 2007/07/15 Glumetza 20mg 6.6
我喜欢将其转换成这样的数据集
Obs# Id Class Date1 MedicationName1 Dose1 Date2 MedicationName2 Dose2 Date3 MedicationName3 Dose3 BloodTestResult
1 1433 1 2007/01/01 Sitaglyptin 100mg 2007/03/24 Sitaglyptin 100mg 2007/09/25 Glucophage 100mg 6.7
2 1433 2 2007/09/25 Glucophage 10mg 2007/12/30 Glucophage 10mg 2008/02/01 Glucophage 10mg 7.2
3 1433 3 2008/05/03 Glumetza 10mg 2008/08/10 Glumetza 10mg - - - 6.7
4 1433 3 2008/11/14 Glumetza 20mg 2009/02/02 Glumetza 20mg - - - 6.5
5 8348 3 2007/04/11 Glumetza 20mg 2007/07/15 Glumetza 20mg - - - 6.6
上面的数据集根据这些标准中的任何一个从行转换为列。
场景 1) 药物变化 (MedicantionName) 或剂量变化 (Dose)
Observations 1,2,3 are same Medications (Sitaglyptin) and same dose (100mg).
So these three rows (1,2,3) are transformed into one row (row 1) as
shown in the tranformed dataset and
The last column BloodTestResults will contain the value from 4th row (6.7).
Similarly rows 4,5,6 because of Medication change(Glucophage). These
three rows 4,5,6 are transformed to a single row 2 as shown in the new
dataset and
The last column BloodTestResults will contain the value from 7th row (7.2).
Similarly rows 7 and 8 because of Medication change (Glumetza). These
two rows 7,8 are transformed to a single row 3 as shown in the new
dataset and
The last column BloodTestResults will contain the value from 9th row (6.7).
场景 2) 药物变化 (MedicantionName) 或剂量变化 (Dose)
Rows 9, 10 are transformed to a single row 4 as shown in the new dataset
because of dosage change(20mg) and
The last column BloodTestResults will contain the value from 10th row
(6.5) and not 11th row because this is the last
medication/dosage change for the id 1433
场景 3) 该 patientId 记录的最后一次用药
Rows 11,12 represent the only or last available information regarding
id 8348. So they are just transformed to single row 5 as shown in the
transformed dataset and
The last column BloodTestResults will contain the value from 12th row
(6.6) because this is the last
medication/dosage change for the id 8348
如果这很混乱,我深表歉意,希望我已经清楚地解释了转换此数据集的模式。感谢您根据这些要求转换此数据集的任何帮助。
数据
df <- structure(list(Obs = 1:12, Id = c(1433L, 1433L, 1433L, 1433L,
1433L, 1433L, 1433L, 1433L, 1433L, 1433L, 8348L, 8348L), Class = c(1L,
1L, 1L, 2L, 2L, 2L, 3L, 3L, 3L, 3L, 3L, 3L), Date = structure(c(1L,
2L, 4L, 6L, 7L, 8L, 9L, 10L, 11L, 12L, 3L, 5L), .Label = c("2007/01/01",
"2007/03/24", "2007/04/11", "2007/06/15", "2007/07/15", "2007/09/25",
"2007/12/30", "2008/02/01", "2008/05/03", "2008/08/10", "2008/11/14",
"2009/02/02"), class = "factor"), MedicationName = structure(c(3L,
3L, 3L, 1L, 1L, 1L, 2L, 2L, 2L, 2L, 2L, 2L), .Label = c("Glucophage",
"Glumetza", "Sitaglyptin"), class = "factor"), Dose = structure(c(1L,
1L, 1L, 2L, 2L, 2L, 2L, 2L, 3L, 3L, 3L, 3L), .Label = c("100mg",
"10mg", "20mg"), class = "factor"), BloodTestResult = c(6.2,
6.4, 6.5, 6.7, 6.5, 6.6, 7.2, 6.4, 6.7, 6.5, 6.5, 6.6)), .Names = c("Obs",
"Id", "Class", "Date", "MedicationName", "Dose", "BloodTestResult"
), class = "data.frame", row.names = c(NA, -12L))
这是一种棘手的数据转换,尤其是 BloodTestResult,因为它需要 Id、Class(或 MedicationName)和 Dose 的初始分组之外的数据。将其分成几步,您可以尝试以下操作,(我将数据称为 dat
)
## First split data: Id, Class and Dose
groups <- split(dat, interaction(dat$Id, dat$Class, dat$Dose, drop=T))
## Then, for each grouping, split by rows the columns you want to expand
tmp <- lapply(groups, function(x)
cbind(x[1,1:3], do.call(cbind, split(x[,-c(1:3, ncol(x))], 1:nrow(x)))))
## Put back into data.frame
library(plyr) # for rbind.fill, since some data.frames are missing columns
res <- do.call(rbind.fill, tmp)
## Finally, add the bloodtest
res$BloodTestResult <- unlist(sapply(split(dat, dat$Id), function(x)
c(x$BloodTestResult[c(F, !(tail(x$Dose, -1) == head(x$Dose, -1) &
tail(x$Class, -1) == head(x$Class, -1)))],
tail(x$BloodTestResult, 1))))
# Obs Id Class 1.Date 1.MedicationName 1.Dose 2.Date 2.MedicationName
# 1 1 1433 1 2007/01/01 Sitaglyptin 100mg 2007/03/24 Sitaglyptin
# 2 4 1433 2 2007/09/25 Glucophage 10mg 2007/12/30 Glucophage
# 3 7 1433 3 2008/05/03 Glumetza 10mg 2008/08/10 Glumetza
# 4 9 1433 3 2008/11/14 Glumetza 20mg 2009/02/02 Glumetza
# 5 11 8348 3 2007/04/11 Glumetza 20mg 2007/07/15 Glumetza
# 2.Dose 3.Date 3.MedicationName 3.Dose BloodTestResult
# 1 100mg 2007/06/15 Sitaglyptin 100mg 6.7
# 2 10mg 2008/02/01 Glucophage 10mg 7.2
# 3 10mg <NA> <NA> <NA> 6.7
# 4 20mg <NA> <NA> <NA> 6.5
# 5 20mg <NA> <NA> <NA> 6.6
BloodTest 列的计算方法是首先按 Id 拆分数据,然后查找 Dose 或 Class 的变化,并在这些位置提取 BloodTestResult,然后合并每个 Id 的最后一个 BloodTestValue。