如何计算列表中元素的可变重要性的平均值?
how can I calculate the mean of variable importance of elements in the list?
我正在训练随机森林算法三次并将变量的重要性保存到列表中(使用 caret 包)。如果存在,我如何计算每个特征的平均值?
例如,如何计算三个整体“ESR”的平均值? (我要训练这个算法一千次)
这些是我的例子:
[[1]]
rf variable importance
only 20 most important variables shown (out of 119)
Overall
Albumin 100.00
age 97.36
PR 60.18
RR 42.41
Weight 35.26
SystolicBP 32.14
Cancers1 29.79
ESR 27.66
Neutrophyl 26.98
CPK 25.68
EjectionFraction 25.59
BMI 24.42
Calcium 23.87
WBC 22.36
Urea 22.01
LDH 21.23
FBS 20.21
Ddimer 19.32
HB 18.99
Lymphocyte 18.78
[[2]]
rf variable importance
only 20 most important variables shown (out of 119)
Overall
age 100.00
FBS 57.80
WBC 53.88
PR 53.84
Neutrophyl 53.52
Weight 52.31
HB 51.69
LDH 50.15
Urea 49.31
Albumin 47.05
Lymphocyte 46.87
CPK 46.54
SystolicBP 45.64
Calcium 44.87
ESR 43.54
Ferritin 43.03
CRP 43.00
PLT 42.83
Creatinine 42.53
EjectionFraction 41.43
[[3]]
rf variable importance
only 20 most important variables shown (out of 119)
Overall
age 100.00
Albumin 43.41
Weight 24.88
FBS 24.63
BS 23.31
PR 21.47
LDH 21.06
Neutrophyl 20.68
BMI 17.94
EjectionFraction 17.29
CPK 16.49
WBC 16.11
ALP 15.72
RR 15.28
Lymphocyte 14.94
Cancers1 14.68
CRP 14.50
ESR 14.38
Ddimer 13.05
Ferritin 12.96
我可以创建一个保存要素及其整体的数据框吗?
谢谢你的帮助
这是我的代码:
prediction_value_rf=list()
importance_rf=list()
auc_rf=list()
weight_rf=list()
for ( i in 1:1000){
resample_death <- death[sample(nrow(death), size=300), ]
resample_alive <-alive[sample(nrow(alive), size=300), ]
f_dataset=rbind(resample_alive,resample_death)
inx <- sample.split(seq_len(nrow(f_dataset)), 0.25)
trainData<- f_dataset[!inx, ]
testData <- f_dataset[inx, ]
rf_fit <- train(vital_status ~ .,
data = trainData,
method = "rf",
)
pred=predict(rf_fit, testData[,-109])
pred1=predict(rf_fit, testData[,-109],type='prob')
prediction_value_rf[[i]]=pred1[2]
auc=auc(testData$vital_status,as.numeric(pred1[[2]]),direction="<", levels = levels(testData$vital_status))
auc_rf[[i]]=auc
a=varImp(rf_fit,scale = TRUE)
importance_rf[[i]] <- a
weight_rf[[i]]=max(rf_fit$results$Accuracy)
}
最后,我想计算所有整体特征的平均值(想创建集成模型)。
我的数据集包含 109 个特征和 4200 个样本。
> dput(importance_rf)
list(structure(list(importance = structure(list(Overall = c(100,
32.9191368970689, 0, 29.4889011862606, 24.8664587940577, 21.8746288172869,
21.7051171149606, 20.0868919191658, 20.3678665772965, 20.2873319598582,
33.7597621482843, 42.1891066454062, 22.7027798691687, 17.0766042463516,
39.4559095867264, 17.9431725056776, 23.2881573588367, 5.04721532342669,
22.3290849893345, 20.7266835722104, 21.5723519894789, 19.5211504808207,
21.2794742178794, 20.1624361665348, 13.7420140365184, 31.7941409073075,
20.9409991203303, 30.4229311296897, 11.5187371425859, 12.8487688047673,
9.40749461290917, 10.361793419014, 32.5677389075859, 26.5411449178312,
23.3996095888034, 2.84823906954271, 10.0257295515002, 2.27406632480383,
0.221285401034356, 0.844517489791465, 1.97286969198767, 0.0909347758420391,
0.541007254389242, 0.359718315763083, 1.26912866459011, 0.158954429130366,
0.245159217854806, 1.43768928047267, 0.796627703857018, 0.0731764363395144,
1.72357935713514, 0.424562470997031, 3.38312715168264, 1.88770244332681,
0.0314985706869475, 0, 0.65427952713802, 0, 0.0171557103229226,
0.709743254593806, 1.13539938842206, 0.0367104133426984, 2.95211595985093,
0, 0.582868854914444, 0.393813676879418, 1.15732422255054, 2.24940561099934,
1.73472209382337, 1.34428847541862, 1.15486784386305, 0, 0.689216959226089,
0.625678629482648, 1.81161997423301, 0.433030827900777, 10.9106578268112,
2.24295278032112, 18.176936900799, 1.74711580562318, 1.45310012173878,
0.952143653091356, 1.16652405720194, 1.11866015943186, 2.68527336222893,
1.12853921993574, 5.10727247259446, 1.93994049536545, 1.36475795626174,
2.95717137358439, 0.115367165512589, 0, 1.45815337045876, 0,
1.78943634306828, 5.71749991297189, 2.43536004133198, 1.27231795918686,
11.4771984230702, 3.0971032186365, 0.708058471655881, 0.170261025718881,
3.37435307537382, 1.56044494248123, 1.09294450754124, 0, 2.25592933845801,
2.30276525800757, 1.86149986210819, 1.46145976307003, 1.26858067553346,
2.11041986636824, 0.0902116364175813, 1.54299863875175, 0, 0.269632340125967,
1.88548693593634, 4.47233507072462, 0.66752451890319)), class = "data.frame", row.names = c("age",
"Weight", "HookhConsumption", "BMI", "SystolicBP", "RR", "DiastolicBP",
"ALP", "ALT", "AST", "Albumin", "BS", "CPK", "CRP", "Calcium",
"Creatinine", "Ddimer", "Directbilirubin", "ESR", "FBS", "Ferritin",
"HB", "LDH", "Lymphocyte", "Mg", "Neutrophyl", "PLT", "PR", "PhosphorP",
"PotassiumK", "SodiumNA", "Totalbilirubin", "Urea", "WBC", "EjectionFraction",
"TotalLungInvolvementRank", "TotalLungInvolvementPercent", "sex2",
"Type.of.heart.disease1", "Type.of.heart.disease2", "Type.of.heart.disease9",
"Unilateral.paralysis1", "Ulcers1", "Obesity.BMI.above.351",
"Peripheral.artery.disease1", "organ.involment.from.diabetes1",
"organ.involment.from.diabetes2", "organ.involment.from.diabetes3",
"UsingDrugHistory1", "UsingAlcoholHistory1", "Transplantation1",
"SeverityofKidneyDisease1", "SeverityofKidneyDisease2", "SeverityofKidneyDisease3",
"SeverityChronicliverdisease1", "SeverityChronicliverdisease2",
"SeverityChronicliverdisease3", "SeverityChronicliverdisease4",
"SeverityChronicliverdisease9", "Schizophrenia1", "Rheumatologicaldiseases1",
"Pregnant1", "Neurologicaldiseases1", "LiverTransplantation1",
"KidneyTransplantation1", "Immunedeficiencydisease1", "Hypothyroidism1",
"Hypertention1", "Hyperlipidemia1", "Historyofsmoking1", "HistoryofHookah1",
"HeartTransplantation1", "HIV1", "FattyLiver1", "Diabetes1",
"Chronicliverdisease1", "Chronickidneydisease1", "CardiovascularDisease1",
"Cancers1", "CVAStrokeCVDTIA1", "COPD1", "Asthma1", "WetCough1",
"WeightLoss1", "WeaknessandLethargy1", "Vomit1", "Trembling1",
"Sweating1", "Sputum1", "Sorethroat1", "SkinRush1", "Rush1",
"Rhinorrhea1", "PharynxExoda1", "Nausea1", "Muscle_Painmyalgia1",
"Lossofsenseoftaste1", "Lossofsenseofsmell1", "LossofConsciousness1",
"LimbEdema1", "Jointpain_Arthralgia1", "Hemoptysis1", "Headace1",
"Fever1", "Fatigue1", "EyeConjunctivitis1", "Epigastric1", "Dyspnea1",
"DryCough1", "Dizziness1", "Diarrhea1", "Chestpain1", "CardiacArrhythmia1",
"Body_Pain1", "Bleeding1", "Ataxia1", "Anorexia1", "PCRCOVID19Test1",
"PCRCOVID19Test2")), model = "rf", calledFrom = "varImp"), class = "varImp.train"),
structure(list(importance = structure(list(Overall = c(100,
36.8463357663146, 0, 20.5921448468941, 35.0980630859042,
15.7098956910968, 27.5542325637653, 22.3935810225052, 25.6062709809081,
18.9072078537409, 30.5428709528983, 26.4061314161858, 27.2933977255992,
18.3744993875278, 57.5115149169245, 14.4361277134982, 49.9265957132235,
6.10831602661626, 28.2527379885906, 23.0147565449908, 32.7997892888894,
22.7055707536584, 36.9763807158356, 28.9941599048441, 17.8186386653819,
31.2682240107287, 26.2894098494535, 41.1751827476675, 22.6316241605114,
16.9314172346857, 14.4927913128733, 13.1792980470757, 44.2836496383372,
32.7246002717468, 30.3912750391576, 10.0409713536124, 9.83444013035946,
2.50470824612248, 1.72055335723373, 1.05083165735798, 1.56193393834476,
0.233521622728958, 1.08064736921506, 0.555709266569136, 2.40106539585553,
0.291833555475466, 0.380999891346632, 2.56592221397732, 1.62107348934456,
0.504647559430998, 1.19859835755469, 0, 1.4382135880929,
1.94514657535966, 0, 0.0569205442253742, 0.44589056596685,
0.0539230755197555, 0, 0.055077983652405, 1.24527213390211,
0, 1.36267778294481, 0.151259347248717, 0.499919817645286,
0, 2.79981213016671, 2.72663427247346, 1.93725253183476,
2.70715099933653, 1.99722906280419, 0, 0.111342938271961,
1.2426657762317, 2.15186257620788, 0.584084013981451, 9.87542370836023,
3.21493418783175, 14.6556614893423, 0.67462103889104, 0.787088521176588,
2.61946726039402, 2.8099384934716, 0.377053883833586, 2.2824838493133,
1.12217532020233, 3.44210364347885, 2.61343827037804, 9.58864870521531,
1.77823199575717, 0, 0, 0.828679129518211, 0, 2.73842874693014,
14.5506870851474, 0.390367251047195, 0.811902694072225, 15.5803912323052,
4.18258978600944, 2.13546475796113, 2.66088800284236, 2.97761832225233,
3.54039994200135, 2.44519084017892, 0.737528372419208, 2.20708600548186,
4.12502178170407, 3.1835668678093, 7.61195991815971, 2.35303302862437,
5.70342032074721, 0.409606955773683, 2.4977310780031, 0.0107020031498121,
0.268000372472171, 2.32396173268619, 1.64515893404575, 0.868523484401606
)), class = "data.frame", row.names = c("age", "Weight",
"HookhConsumption", "BMI", "SystolicBP", "RR", "DiastolicBP",
"ALP", "ALT", "AST", "Albumin", "BS", "CPK", "CRP", "Calcium",
"Creatinine", "Ddimer", "Directbilirubin", "ESR", "FBS",
"Ferritin", "HB", "LDH", "Lymphocyte", "Mg", "Neutrophyl",
"PLT", "PR", "PhosphorP", "PotassiumK", "SodiumNA", "Totalbilirubin",
"Urea", "WBC", "EjectionFraction", "TotalLungInvolvementRank",
"TotalLungInvolvementPercent", "sex2", "Type.of.heart.disease1",
"Type.of.heart.disease2", "Type.of.heart.disease9", "Unilateral.paralysis1",
"Ulcers1", "Obesity.BMI.above.351", "Peripheral.artery.disease1",
"organ.involment.from.diabetes1", "organ.involment.from.diabetes2",
"organ.involment.from.diabetes3", "UsingDrugHistory1", "UsingAlcoholHistory1",
"Transplantation1", "SeverityofKidneyDisease1", "SeverityofKidneyDisease2",
"SeverityofKidneyDisease3", "SeverityChronicliverdisease1",
"SeverityChronicliverdisease2", "SeverityChronicliverdisease3",
"SeverityChronicliverdisease4", "SeverityChronicliverdisease9",
"Schizophrenia1", "Rheumatologicaldiseases1", "Pregnant1",
"Neurologicaldiseases1", "LiverTransplantation1", "KidneyTransplantation1",
"Immunedeficiencydisease1", "Hypothyroidism1", "Hypertention1",
"Hyperlipidemia1", "Historyofsmoking1", "HistoryofHookah1",
"HeartTransplantation1", "HIV1", "FattyLiver1", "Diabetes1",
"Chronicliverdisease1", "Chronickidneydisease1", "CardiovascularDisease1",
"Cancers1", "CVAStrokeCVDTIA1", "COPD1", "Asthma1", "WetCough1",
"WeightLoss1", "WeaknessandLethargy1", "Vomit1", "Trembling1",
"Sweating1", "Sputum1", "Sorethroat1", "SkinRush1", "Rush1",
"Rhinorrhea1", "PharynxExoda1", "Nausea1", "Muscle_Painmyalgia1",
"Lossofsenseoftaste1", "Lossofsenseofsmell1", "LossofConsciousness1",
"LimbEdema1", "Jointpain_Arthralgia1", "Hemoptysis1", "Headace1",
"Fever1", "Fatigue1", "EyeConjunctivitis1", "Epigastric1",
"Dyspnea1", "DryCough1", "Dizziness1", "Diarrhea1", "Chestpain1",
"CardiacArrhythmia1", "Body_Pain1", "Bleeding1", "Ataxia1",
"Anorexia1", "PCRCOVID19Test1", "PCRCOVID19Test2")), model = "rf",
calledFrom = "varImp"), class = "varImp.train"), structure(list(
importance = structure(list(Overall = c(100, 36.4519408382731,
0.0121282468302786, 27.9982404793903, 19.4487163883379,
24.6079653972917, 14.1539998143239, 18.684018340339,
20.1182663550791, 17.4200861293186, 46.6309831468223,
52.2217679510578, 28.5910698857479, 16.845796014194,
31.6509235655573, 17.1000574614637, 27.8424176478161,
5.69845064904499, 21.3838903337718, 20.217605303817,
19.8702958841878, 22.3737582989512, 33.0788664305301,
20.6035947546629, 16.3220426343042, 23.4809287675538,
23.1749036748423, 57.122094059206, 12.2409421568247,
11.234114301956, 15.7946508155502, 8.80563729211453,
20.2205078755919, 20.3091908316546, 27.7497357152039,
3.8622908315769, 12.8894291926347, 5.96701805516155,
0.761922263853243, 1.41991036581607, 1.54560737492769,
0.825161722105208, 0.0172016746252156, 0.693982409239905,
0, 0.358366468201754, 1.74812586771487, 2.2746344067366,
0.745595100629448, 0.465199425668223, 0.408092232849501,
0.115358703965213, 0.0358338604150282, 2.88640197248697,
0, 0.288302498762889, 0.332551323637155, 0.0121282468302786,
0, 1.03515126482736, 1.1213600137207, 0.329413397366096,
2.0612368962315, 0, 0.610994615626186, 1.0215655608971,
3.90651448858199, 1.73374217783332, 1.47244358073369,
2.20534241559288, 0.173681720638885, 0, 0.631950099628902,
0.132328128708788, 2.92435478031454, 1.03537122788376,
4.74067414123091, 1.77981701502525, 13.1150432121738,
0.720556880972878, 1.20366662244445, 1.19169376389038,
1.86442992849398, 0.518200723424615, 2.278501378269,
1.23638371282217, 3.66947066761794, 2.03933409738165,
1.25289331603719, 1.01627904400807, 0.0324453169731015,
0, 2.29817177168672, 0, 1.53194610140319, 7.15322639329996,
0.759542631415349, 1.53353473284619, 4.77390474517756,
1.05656481042379, 0.699450154375729, 1.16224285818854,
3.65223350861514, 1.93274707207956, 1.57589588221639,
0.449432695377871, 1.36863730886437, 2.11275137384133,
3.29450357362525, 1.08676677214028, 2.18565092410049,
1.15456248328987, 0.492245547306216, 1.59592156033113,
0.0129367966189638, 0.514499765305734, 1.58591810753971,
1.84832826238423, 0.807564130566264)), class = "data.frame", row.names = c("age",
"Weight", "HookhConsumption", "BMI", "SystolicBP", "RR",
"DiastolicBP", "ALP", "ALT", "AST", "Albumin", "BS",
"CPK", "CRP", "Calcium", "Creatinine", "Ddimer", "Directbilirubin",
"ESR", "FBS", "Ferritin", "HB", "LDH", "Lymphocyte",
"Mg", "Neutrophyl", "PLT", "PR", "PhosphorP", "PotassiumK",
"SodiumNA", "Totalbilirubin", "Urea", "WBC", "EjectionFraction",
"TotalLungInvolvementRank", "TotalLungInvolvementPercent",
"sex2", "Type.of.heart.disease1", "Type.of.heart.disease2",
"Type.of.heart.disease9", "Unilateral.paralysis1", "Ulcers1",
"Obesity.BMI.above.351", "Peripheral.artery.disease1",
"organ.involment.from.diabetes1", "organ.involment.from.diabetes2",
"organ.involment.from.diabetes3", "UsingDrugHistory1",
"UsingAlcoholHistory1", "Transplantation1", "SeverityofKidneyDisease1",
"SeverityofKidneyDisease2", "SeverityofKidneyDisease3",
"SeverityChronicliverdisease1", "SeverityChronicliverdisease2",
"SeverityChronicliverdisease3", "SeverityChronicliverdisease4",
"SeverityChronicliverdisease9", "Schizophrenia1", "Rheumatologicaldiseases1",
"Pregnant1", "Neurologicaldiseases1", "LiverTransplantation1",
"KidneyTransplantation1", "Immunedeficiencydisease1",
"Hypothyroidism1", "Hypertention1", "Hyperlipidemia1",
"Historyofsmoking1", "HistoryofHookah1", "HeartTransplantation1",
"HIV1", "FattyLiver1", "Diabetes1", "Chronicliverdisease1",
"Chronickidneydisease1", "CardiovascularDisease1", "Cancers1",
"CVAStrokeCVDTIA1", "COPD1", "Asthma1", "WetCough1",
"WeightLoss1", "WeaknessandLethargy1", "Vomit1", "Trembling1",
"Sweating1", "Sputum1", "Sorethroat1", "SkinRush1", "Rush1",
"Rhinorrhea1", "PharynxExoda1", "Nausea1", "Muscle_Painmyalgia1",
"Lossofsenseoftaste1", "Lossofsenseofsmell1", "LossofConsciousness1",
"LimbEdema1", "Jointpain_Arthralgia1", "Hemoptysis1",
"Headace1", "Fever1", "Fatigue1", "EyeConjunctivitis1",
"Epigastric1", "Dyspnea1", "DryCough1", "Dizziness1",
"Diarrhea1", "Chestpain1", "CardiacArrhythmia1", "Body_Pain1",
"Bleeding1", "Ataxia1", "Anorexia1", "PCRCOVID19Test1",
"PCRCOVID19Test2")), model = "rf", calledFrom = "varImp"), class = "varImp.train"))
这部分:
how can I calculate the mean of each feature if it exists? for example, how can I calculate the mean of three overall "ESR"?
因为你已经生成了列表,所以你可以创建一个函数,选择包含特征名称的行,然后将这个函数应用于列表的每个元素,然后将其展平,然后计算均值.如果在某些元素中该特征不存在,可以使用 na.rm
.
将其排除在均值计算之外
例如,这类似于您的列表:
mylist <- list(structure(list(Overall = c(100, 97.36, 60.18, 42.41, 35.26,
32.14, 29.79, 27.66, 26.98, 25.68, 25.59, 24.42, 23.87, 22.36,
22.01, 21.23, 20.21, 19.32, 18.99, 18.78)), class = "data.frame", row.names = c("Albumin",
"age", "PR", "RR", "Weight", "SystolicBP", "Cancers1", "ESR",
"Neutrophyl", "CPK", "EjectionFraction", "BMI", "Calcium", "WBC",
"Urea", "LDH", "FBS", "Ddimer", "HB", "Lymphocyte")), structure(list(
Overall = c(100, 57.8, 53.88, 53.84, 53.52, 52.31, 51.69,
50.15, 49.31, 47.05, 46.87, 46.54, 45.64, 44.87, 43.54, 43.03,
43, 42.83, 42.53, 41.43)), class = "data.frame", row.names = c("age",
"FBS", "WBC", "PR", "Neutrophyl", "Weight", "HB", "LDH", "Urea",
"Albumin", "Lymphocyte", "CPK", "SystolicBP", "Calcium", "ESR",
"Ferritin", "CRP", "PLT", "Creatinine", "EjectionFraction")),
structure(list(Overall = c(100, 43.41, 24.88, 24.63, 23.31,
21.47, 21.06, 20.68, 17.94, 17.29, 16.49, 16.11, 15.72, 15.28,
14.94, 14.68, 14.5, 14.38, 13.05, 12.96)), class = "data.frame", row.names = c("age",
"Albumin", "Weight", "FBS", "BS", "PR", "LDH", "Neutrophyl",
"BMI", "EjectionFraction", "CPK", "WBC", "ALP", "RR", "Lymphocyte",
"Cancers1", "CRP", "ESR", "Ddimer", "Ferritin")))
以下是如何计算 ESR
的平均值,它存在于所有元素中,而 CRP
不存在于其中一个元素中:
mylist |> lapply(function(dat) dat["ESR", "Overall"]) |> unlist() |> mean(na.rm = TRUE)
#[1] 28.52667
mylist |> lapply(function(dat) dat["CRP", "Overall"]) |> unlist() |> mean(na.rm = TRUE)
#[1] 28.75
因为您有很多特征,您可以创建另一个函数来将此步骤应用于每个特征。例如:
features <- c("ESR", "CRP", "CPK", "WBC", "LDH")
feature_mean <- function(feature_name){
out <- lapply(mylist, function(dat) dat[feature_name, "Overall"])|>
unlist() |> mean(na.rm = TRUE) |>
setNames(paste0("mean_",feature_name))
return(out)
}
features |> lapply(feature_mean) |> unlist()
#mean_ESR mean_CRP mean_CPK mean_WBC mean_LDH
#28.52667 28.75000 29.57000 30.78333 30.81333
编辑
上一示例中使用的合成数据 mylist
在其每个元素中仅包含一个“整体”数据框对象,因此可以将特征提取直接应用于数据lapply
。但是,您在更新后的问题中提供的实际数据 importance_rf
在其每个元素中都有多个对象,“总体”数据框对象位于第一个元素中。不同之处在于您在评论中显示的错误原因。要应用提取,应首先使用 lapply(function(list) list[[1]])
提取“整体”数据帧,然后应用前面的步骤。
# Extract mean ESR
importance_rf |>
lapply(function(list) list[[1]]) |>
lapply(function(dat) dat["ESR", "Overall"]) |>
unlist() |>
mean(na.rm = TRUE)
#[1] 23.98857
# Extract mean CRP
importance_rf |>
lapply(function(list) list[[1]]) |>
lapply(function(dat) dat["CRP", "Overall"]) |>
unlist() |>
mean(na.rm = TRUE)
#[1] 17.4323
一个{base R}方式
可以将前面的步骤应用于特征向量,如下所示:
features <- c("ESR", "CRP", "CPK", "WBC", "LDH")
feature_mean <- function(feature_name){
out <- importance_rf |>
lapply(function(list) list[[1]]) |>
lapply(function(dat) dat[feature_name, "Overall"])|>
unlist() |> mean(na.rm = TRUE) |>
setNames(paste0("mean_",feature_name))
return(out)
}
# Extract the mean values
features |> lapply(feature_mean) |> unlist()
#mean_ESR mean_CRP mean_CPK mean_WBC mean_LDH
#23.98857 17.43230 26.19575 26.52498 30.44491
关于代码的简单说明:
lapply(function(list) list[[1]])
提取important_rf
列表中每个元素的第一个元素,即包含特征数据的数据框。
dat[feature_name, "Overall"]
在每个提取的数据帧中提取目标特征的值 feature_name
。每一步只从每个数据帧中提取一个特征。
unlist()
将提取特征的数据结构从列表转换为数值向量。
setNames
为数值向量创建名称,以便于识别正在计算均值的特征。
这样使用的函数都属于base R
类。
您无需安装任何外部包即可获取它们。
另一种选择是将基本 R 函数与 purrr
包中的其他函数组合使用。
一个{purrr}
方式
library(purrr)
importance_rf |>
map(pluck(1,1)) |>
map(function(dat) set_names(dat[features,], features)) |>
as.data.frame() |>
rowMeans() |>
set_names(paste0("mean_", features))
#mean_ESR mean_CRP mean_CPK mean_WBC mean_LDH
#23.98857 17.43230 26.19575 26.52498 30.44491
这些步骤比上面 base R 中的步骤短得多,但每个步骤中所做的事情可能不太明显。
请注意,map
与 lapply
类似,pluck(x,1,1)
与 x[[1]][[1]]
等价。
关于代码的简单说明:
map(pluck(1,1))
提取数据帧,与上面的 lapply(function(list) list[[1]])
类似。
map(function(dat) set_names(dat[features,], features))
提取特征列表,类似于上面的dat[feature_name, "Overall"]
。
有区别:
在上面的base R方式中,从所有数据帧中提取每个特征,然后计算平均值,然后以相同的方式提取另一个特征。
在这种purrr方式中,从列表中的每个数据框中提取所有目标特征,然后使用as.data.frame
将这些特征组合成一个新的数据框,这样每一行代表一个特征.然后,rowMeans
用于计算特征所有值的平均值。
请注意,您可以在 |>
管道之前检查每个步骤的结果。例如,importance_rf
将显示每个元素中的所有对象。
importance_rf |> map(pluck(1,1))
将仅显示数据框对象。
包含加权均值的更新
这是一个简单的示例,说明如何计算列表中每个功能的加权平均值。假设你有这个列表:
some.list <- list(L1 = c(a = 2, b = 4, c = 7),
L2 = c(a = 5, b = 5, c = 2),
L3 = c(a = 3, b = 3, c = 6))
some.list
$L1
a b c
2 4 7
$L2
a b c
5 5 2
$L3
a b c
3 3 6
假设列表中的 L1、L2 和 L3 具有以下权重值:
weight <- c(w.L1 = 0.5, w.L2=0.6, w.L3 = 0.9)
weight
w.L1 w.L2 w.L3
0.5 0.6 0.9
计算a的加权均值,例如需要这样计算:
您可以通过将列表中 a 的每个值乘以相应的归一化权重来获得此值。在这种情况下,w1 的归一化权重为 w1/(w1+w2+w3)
.
要在 R 中执行这些步骤:
norm.weight <- weight/sum(weight)
norm.weight
w.L1 w.L2 w.L3
0.25 0.30 0.45
# weighted means of a,b, and c
some.list |> map2(norm.weight, `*`) |> as.data.frame() |> rowSums()
a b c
3.35 3.85 5.05
将这些模拟 weight
值应用于您的 importance_rf
列表和示例中的 features
,我们得到:
importance_rf |>
map(pluck(1,1)) |>
map(function(dat) set_names(dat[features,], features)) |>
map2(norm.weight, `*`) |>
as.data.frame() |>
rowSums()
ESR CRP CPK WBC LDH
23.68084 17.36211 26.72970 25.59180 31.29827
我正在训练随机森林算法三次并将变量的重要性保存到列表中(使用 caret 包)。如果存在,我如何计算每个特征的平均值? 例如,如何计算三个整体“ESR”的平均值? (我要训练这个算法一千次) 这些是我的例子:
[[1]]
rf variable importance
only 20 most important variables shown (out of 119)
Overall
Albumin 100.00
age 97.36
PR 60.18
RR 42.41
Weight 35.26
SystolicBP 32.14
Cancers1 29.79
ESR 27.66
Neutrophyl 26.98
CPK 25.68
EjectionFraction 25.59
BMI 24.42
Calcium 23.87
WBC 22.36
Urea 22.01
LDH 21.23
FBS 20.21
Ddimer 19.32
HB 18.99
Lymphocyte 18.78
[[2]]
rf variable importance
only 20 most important variables shown (out of 119)
Overall
age 100.00
FBS 57.80
WBC 53.88
PR 53.84
Neutrophyl 53.52
Weight 52.31
HB 51.69
LDH 50.15
Urea 49.31
Albumin 47.05
Lymphocyte 46.87
CPK 46.54
SystolicBP 45.64
Calcium 44.87
ESR 43.54
Ferritin 43.03
CRP 43.00
PLT 42.83
Creatinine 42.53
EjectionFraction 41.43
[[3]]
rf variable importance
only 20 most important variables shown (out of 119)
Overall
age 100.00
Albumin 43.41
Weight 24.88
FBS 24.63
BS 23.31
PR 21.47
LDH 21.06
Neutrophyl 20.68
BMI 17.94
EjectionFraction 17.29
CPK 16.49
WBC 16.11
ALP 15.72
RR 15.28
Lymphocyte 14.94
Cancers1 14.68
CRP 14.50
ESR 14.38
Ddimer 13.05
Ferritin 12.96
我可以创建一个保存要素及其整体的数据框吗? 谢谢你的帮助 这是我的代码:
prediction_value_rf=list()
importance_rf=list()
auc_rf=list()
weight_rf=list()
for ( i in 1:1000){
resample_death <- death[sample(nrow(death), size=300), ]
resample_alive <-alive[sample(nrow(alive), size=300), ]
f_dataset=rbind(resample_alive,resample_death)
inx <- sample.split(seq_len(nrow(f_dataset)), 0.25)
trainData<- f_dataset[!inx, ]
testData <- f_dataset[inx, ]
rf_fit <- train(vital_status ~ .,
data = trainData,
method = "rf",
)
pred=predict(rf_fit, testData[,-109])
pred1=predict(rf_fit, testData[,-109],type='prob')
prediction_value_rf[[i]]=pred1[2]
auc=auc(testData$vital_status,as.numeric(pred1[[2]]),direction="<", levels = levels(testData$vital_status))
auc_rf[[i]]=auc
a=varImp(rf_fit,scale = TRUE)
importance_rf[[i]] <- a
weight_rf[[i]]=max(rf_fit$results$Accuracy)
}
最后,我想计算所有整体特征的平均值(想创建集成模型)。 我的数据集包含 109 个特征和 4200 个样本。
> dput(importance_rf)
list(structure(list(importance = structure(list(Overall = c(100,
32.9191368970689, 0, 29.4889011862606, 24.8664587940577, 21.8746288172869,
21.7051171149606, 20.0868919191658, 20.3678665772965, 20.2873319598582,
33.7597621482843, 42.1891066454062, 22.7027798691687, 17.0766042463516,
39.4559095867264, 17.9431725056776, 23.2881573588367, 5.04721532342669,
22.3290849893345, 20.7266835722104, 21.5723519894789, 19.5211504808207,
21.2794742178794, 20.1624361665348, 13.7420140365184, 31.7941409073075,
20.9409991203303, 30.4229311296897, 11.5187371425859, 12.8487688047673,
9.40749461290917, 10.361793419014, 32.5677389075859, 26.5411449178312,
23.3996095888034, 2.84823906954271, 10.0257295515002, 2.27406632480383,
0.221285401034356, 0.844517489791465, 1.97286969198767, 0.0909347758420391,
0.541007254389242, 0.359718315763083, 1.26912866459011, 0.158954429130366,
0.245159217854806, 1.43768928047267, 0.796627703857018, 0.0731764363395144,
1.72357935713514, 0.424562470997031, 3.38312715168264, 1.88770244332681,
0.0314985706869475, 0, 0.65427952713802, 0, 0.0171557103229226,
0.709743254593806, 1.13539938842206, 0.0367104133426984, 2.95211595985093,
0, 0.582868854914444, 0.393813676879418, 1.15732422255054, 2.24940561099934,
1.73472209382337, 1.34428847541862, 1.15486784386305, 0, 0.689216959226089,
0.625678629482648, 1.81161997423301, 0.433030827900777, 10.9106578268112,
2.24295278032112, 18.176936900799, 1.74711580562318, 1.45310012173878,
0.952143653091356, 1.16652405720194, 1.11866015943186, 2.68527336222893,
1.12853921993574, 5.10727247259446, 1.93994049536545, 1.36475795626174,
2.95717137358439, 0.115367165512589, 0, 1.45815337045876, 0,
1.78943634306828, 5.71749991297189, 2.43536004133198, 1.27231795918686,
11.4771984230702, 3.0971032186365, 0.708058471655881, 0.170261025718881,
3.37435307537382, 1.56044494248123, 1.09294450754124, 0, 2.25592933845801,
2.30276525800757, 1.86149986210819, 1.46145976307003, 1.26858067553346,
2.11041986636824, 0.0902116364175813, 1.54299863875175, 0, 0.269632340125967,
1.88548693593634, 4.47233507072462, 0.66752451890319)), class = "data.frame", row.names = c("age",
"Weight", "HookhConsumption", "BMI", "SystolicBP", "RR", "DiastolicBP",
"ALP", "ALT", "AST", "Albumin", "BS", "CPK", "CRP", "Calcium",
"Creatinine", "Ddimer", "Directbilirubin", "ESR", "FBS", "Ferritin",
"HB", "LDH", "Lymphocyte", "Mg", "Neutrophyl", "PLT", "PR", "PhosphorP",
"PotassiumK", "SodiumNA", "Totalbilirubin", "Urea", "WBC", "EjectionFraction",
"TotalLungInvolvementRank", "TotalLungInvolvementPercent", "sex2",
"Type.of.heart.disease1", "Type.of.heart.disease2", "Type.of.heart.disease9",
"Unilateral.paralysis1", "Ulcers1", "Obesity.BMI.above.351",
"Peripheral.artery.disease1", "organ.involment.from.diabetes1",
"organ.involment.from.diabetes2", "organ.involment.from.diabetes3",
"UsingDrugHistory1", "UsingAlcoholHistory1", "Transplantation1",
"SeverityofKidneyDisease1", "SeverityofKidneyDisease2", "SeverityofKidneyDisease3",
"SeverityChronicliverdisease1", "SeverityChronicliverdisease2",
"SeverityChronicliverdisease3", "SeverityChronicliverdisease4",
"SeverityChronicliverdisease9", "Schizophrenia1", "Rheumatologicaldiseases1",
"Pregnant1", "Neurologicaldiseases1", "LiverTransplantation1",
"KidneyTransplantation1", "Immunedeficiencydisease1", "Hypothyroidism1",
"Hypertention1", "Hyperlipidemia1", "Historyofsmoking1", "HistoryofHookah1",
"HeartTransplantation1", "HIV1", "FattyLiver1", "Diabetes1",
"Chronicliverdisease1", "Chronickidneydisease1", "CardiovascularDisease1",
"Cancers1", "CVAStrokeCVDTIA1", "COPD1", "Asthma1", "WetCough1",
"WeightLoss1", "WeaknessandLethargy1", "Vomit1", "Trembling1",
"Sweating1", "Sputum1", "Sorethroat1", "SkinRush1", "Rush1",
"Rhinorrhea1", "PharynxExoda1", "Nausea1", "Muscle_Painmyalgia1",
"Lossofsenseoftaste1", "Lossofsenseofsmell1", "LossofConsciousness1",
"LimbEdema1", "Jointpain_Arthralgia1", "Hemoptysis1", "Headace1",
"Fever1", "Fatigue1", "EyeConjunctivitis1", "Epigastric1", "Dyspnea1",
"DryCough1", "Dizziness1", "Diarrhea1", "Chestpain1", "CardiacArrhythmia1",
"Body_Pain1", "Bleeding1", "Ataxia1", "Anorexia1", "PCRCOVID19Test1",
"PCRCOVID19Test2")), model = "rf", calledFrom = "varImp"), class = "varImp.train"),
structure(list(importance = structure(list(Overall = c(100,
36.8463357663146, 0, 20.5921448468941, 35.0980630859042,
15.7098956910968, 27.5542325637653, 22.3935810225052, 25.6062709809081,
18.9072078537409, 30.5428709528983, 26.4061314161858, 27.2933977255992,
18.3744993875278, 57.5115149169245, 14.4361277134982, 49.9265957132235,
6.10831602661626, 28.2527379885906, 23.0147565449908, 32.7997892888894,
22.7055707536584, 36.9763807158356, 28.9941599048441, 17.8186386653819,
31.2682240107287, 26.2894098494535, 41.1751827476675, 22.6316241605114,
16.9314172346857, 14.4927913128733, 13.1792980470757, 44.2836496383372,
32.7246002717468, 30.3912750391576, 10.0409713536124, 9.83444013035946,
2.50470824612248, 1.72055335723373, 1.05083165735798, 1.56193393834476,
0.233521622728958, 1.08064736921506, 0.555709266569136, 2.40106539585553,
0.291833555475466, 0.380999891346632, 2.56592221397732, 1.62107348934456,
0.504647559430998, 1.19859835755469, 0, 1.4382135880929,
1.94514657535966, 0, 0.0569205442253742, 0.44589056596685,
0.0539230755197555, 0, 0.055077983652405, 1.24527213390211,
0, 1.36267778294481, 0.151259347248717, 0.499919817645286,
0, 2.79981213016671, 2.72663427247346, 1.93725253183476,
2.70715099933653, 1.99722906280419, 0, 0.111342938271961,
1.2426657762317, 2.15186257620788, 0.584084013981451, 9.87542370836023,
3.21493418783175, 14.6556614893423, 0.67462103889104, 0.787088521176588,
2.61946726039402, 2.8099384934716, 0.377053883833586, 2.2824838493133,
1.12217532020233, 3.44210364347885, 2.61343827037804, 9.58864870521531,
1.77823199575717, 0, 0, 0.828679129518211, 0, 2.73842874693014,
14.5506870851474, 0.390367251047195, 0.811902694072225, 15.5803912323052,
4.18258978600944, 2.13546475796113, 2.66088800284236, 2.97761832225233,
3.54039994200135, 2.44519084017892, 0.737528372419208, 2.20708600548186,
4.12502178170407, 3.1835668678093, 7.61195991815971, 2.35303302862437,
5.70342032074721, 0.409606955773683, 2.4977310780031, 0.0107020031498121,
0.268000372472171, 2.32396173268619, 1.64515893404575, 0.868523484401606
)), class = "data.frame", row.names = c("age", "Weight",
"HookhConsumption", "BMI", "SystolicBP", "RR", "DiastolicBP",
"ALP", "ALT", "AST", "Albumin", "BS", "CPK", "CRP", "Calcium",
"Creatinine", "Ddimer", "Directbilirubin", "ESR", "FBS",
"Ferritin", "HB", "LDH", "Lymphocyte", "Mg", "Neutrophyl",
"PLT", "PR", "PhosphorP", "PotassiumK", "SodiumNA", "Totalbilirubin",
"Urea", "WBC", "EjectionFraction", "TotalLungInvolvementRank",
"TotalLungInvolvementPercent", "sex2", "Type.of.heart.disease1",
"Type.of.heart.disease2", "Type.of.heart.disease9", "Unilateral.paralysis1",
"Ulcers1", "Obesity.BMI.above.351", "Peripheral.artery.disease1",
"organ.involment.from.diabetes1", "organ.involment.from.diabetes2",
"organ.involment.from.diabetes3", "UsingDrugHistory1", "UsingAlcoholHistory1",
"Transplantation1", "SeverityofKidneyDisease1", "SeverityofKidneyDisease2",
"SeverityofKidneyDisease3", "SeverityChronicliverdisease1",
"SeverityChronicliverdisease2", "SeverityChronicliverdisease3",
"SeverityChronicliverdisease4", "SeverityChronicliverdisease9",
"Schizophrenia1", "Rheumatologicaldiseases1", "Pregnant1",
"Neurologicaldiseases1", "LiverTransplantation1", "KidneyTransplantation1",
"Immunedeficiencydisease1", "Hypothyroidism1", "Hypertention1",
"Hyperlipidemia1", "Historyofsmoking1", "HistoryofHookah1",
"HeartTransplantation1", "HIV1", "FattyLiver1", "Diabetes1",
"Chronicliverdisease1", "Chronickidneydisease1", "CardiovascularDisease1",
"Cancers1", "CVAStrokeCVDTIA1", "COPD1", "Asthma1", "WetCough1",
"WeightLoss1", "WeaknessandLethargy1", "Vomit1", "Trembling1",
"Sweating1", "Sputum1", "Sorethroat1", "SkinRush1", "Rush1",
"Rhinorrhea1", "PharynxExoda1", "Nausea1", "Muscle_Painmyalgia1",
"Lossofsenseoftaste1", "Lossofsenseofsmell1", "LossofConsciousness1",
"LimbEdema1", "Jointpain_Arthralgia1", "Hemoptysis1", "Headace1",
"Fever1", "Fatigue1", "EyeConjunctivitis1", "Epigastric1",
"Dyspnea1", "DryCough1", "Dizziness1", "Diarrhea1", "Chestpain1",
"CardiacArrhythmia1", "Body_Pain1", "Bleeding1", "Ataxia1",
"Anorexia1", "PCRCOVID19Test1", "PCRCOVID19Test2")), model = "rf",
calledFrom = "varImp"), class = "varImp.train"), structure(list(
importance = structure(list(Overall = c(100, 36.4519408382731,
0.0121282468302786, 27.9982404793903, 19.4487163883379,
24.6079653972917, 14.1539998143239, 18.684018340339,
20.1182663550791, 17.4200861293186, 46.6309831468223,
52.2217679510578, 28.5910698857479, 16.845796014194,
31.6509235655573, 17.1000574614637, 27.8424176478161,
5.69845064904499, 21.3838903337718, 20.217605303817,
19.8702958841878, 22.3737582989512, 33.0788664305301,
20.6035947546629, 16.3220426343042, 23.4809287675538,
23.1749036748423, 57.122094059206, 12.2409421568247,
11.234114301956, 15.7946508155502, 8.80563729211453,
20.2205078755919, 20.3091908316546, 27.7497357152039,
3.8622908315769, 12.8894291926347, 5.96701805516155,
0.761922263853243, 1.41991036581607, 1.54560737492769,
0.825161722105208, 0.0172016746252156, 0.693982409239905,
0, 0.358366468201754, 1.74812586771487, 2.2746344067366,
0.745595100629448, 0.465199425668223, 0.408092232849501,
0.115358703965213, 0.0358338604150282, 2.88640197248697,
0, 0.288302498762889, 0.332551323637155, 0.0121282468302786,
0, 1.03515126482736, 1.1213600137207, 0.329413397366096,
2.0612368962315, 0, 0.610994615626186, 1.0215655608971,
3.90651448858199, 1.73374217783332, 1.47244358073369,
2.20534241559288, 0.173681720638885, 0, 0.631950099628902,
0.132328128708788, 2.92435478031454, 1.03537122788376,
4.74067414123091, 1.77981701502525, 13.1150432121738,
0.720556880972878, 1.20366662244445, 1.19169376389038,
1.86442992849398, 0.518200723424615, 2.278501378269,
1.23638371282217, 3.66947066761794, 2.03933409738165,
1.25289331603719, 1.01627904400807, 0.0324453169731015,
0, 2.29817177168672, 0, 1.53194610140319, 7.15322639329996,
0.759542631415349, 1.53353473284619, 4.77390474517756,
1.05656481042379, 0.699450154375729, 1.16224285818854,
3.65223350861514, 1.93274707207956, 1.57589588221639,
0.449432695377871, 1.36863730886437, 2.11275137384133,
3.29450357362525, 1.08676677214028, 2.18565092410049,
1.15456248328987, 0.492245547306216, 1.59592156033113,
0.0129367966189638, 0.514499765305734, 1.58591810753971,
1.84832826238423, 0.807564130566264)), class = "data.frame", row.names = c("age",
"Weight", "HookhConsumption", "BMI", "SystolicBP", "RR",
"DiastolicBP", "ALP", "ALT", "AST", "Albumin", "BS",
"CPK", "CRP", "Calcium", "Creatinine", "Ddimer", "Directbilirubin",
"ESR", "FBS", "Ferritin", "HB", "LDH", "Lymphocyte",
"Mg", "Neutrophyl", "PLT", "PR", "PhosphorP", "PotassiumK",
"SodiumNA", "Totalbilirubin", "Urea", "WBC", "EjectionFraction",
"TotalLungInvolvementRank", "TotalLungInvolvementPercent",
"sex2", "Type.of.heart.disease1", "Type.of.heart.disease2",
"Type.of.heart.disease9", "Unilateral.paralysis1", "Ulcers1",
"Obesity.BMI.above.351", "Peripheral.artery.disease1",
"organ.involment.from.diabetes1", "organ.involment.from.diabetes2",
"organ.involment.from.diabetes3", "UsingDrugHistory1",
"UsingAlcoholHistory1", "Transplantation1", "SeverityofKidneyDisease1",
"SeverityofKidneyDisease2", "SeverityofKidneyDisease3",
"SeverityChronicliverdisease1", "SeverityChronicliverdisease2",
"SeverityChronicliverdisease3", "SeverityChronicliverdisease4",
"SeverityChronicliverdisease9", "Schizophrenia1", "Rheumatologicaldiseases1",
"Pregnant1", "Neurologicaldiseases1", "LiverTransplantation1",
"KidneyTransplantation1", "Immunedeficiencydisease1",
"Hypothyroidism1", "Hypertention1", "Hyperlipidemia1",
"Historyofsmoking1", "HistoryofHookah1", "HeartTransplantation1",
"HIV1", "FattyLiver1", "Diabetes1", "Chronicliverdisease1",
"Chronickidneydisease1", "CardiovascularDisease1", "Cancers1",
"CVAStrokeCVDTIA1", "COPD1", "Asthma1", "WetCough1",
"WeightLoss1", "WeaknessandLethargy1", "Vomit1", "Trembling1",
"Sweating1", "Sputum1", "Sorethroat1", "SkinRush1", "Rush1",
"Rhinorrhea1", "PharynxExoda1", "Nausea1", "Muscle_Painmyalgia1",
"Lossofsenseoftaste1", "Lossofsenseofsmell1", "LossofConsciousness1",
"LimbEdema1", "Jointpain_Arthralgia1", "Hemoptysis1",
"Headace1", "Fever1", "Fatigue1", "EyeConjunctivitis1",
"Epigastric1", "Dyspnea1", "DryCough1", "Dizziness1",
"Diarrhea1", "Chestpain1", "CardiacArrhythmia1", "Body_Pain1",
"Bleeding1", "Ataxia1", "Anorexia1", "PCRCOVID19Test1",
"PCRCOVID19Test2")), model = "rf", calledFrom = "varImp"), class = "varImp.train"))
这部分:
how can I calculate the mean of each feature if it exists? for example, how can I calculate the mean of three overall "ESR"?
因为你已经生成了列表,所以你可以创建一个函数,选择包含特征名称的行,然后将这个函数应用于列表的每个元素,然后将其展平,然后计算均值.如果在某些元素中该特征不存在,可以使用 na.rm
.
例如,这类似于您的列表:
mylist <- list(structure(list(Overall = c(100, 97.36, 60.18, 42.41, 35.26,
32.14, 29.79, 27.66, 26.98, 25.68, 25.59, 24.42, 23.87, 22.36,
22.01, 21.23, 20.21, 19.32, 18.99, 18.78)), class = "data.frame", row.names = c("Albumin",
"age", "PR", "RR", "Weight", "SystolicBP", "Cancers1", "ESR",
"Neutrophyl", "CPK", "EjectionFraction", "BMI", "Calcium", "WBC",
"Urea", "LDH", "FBS", "Ddimer", "HB", "Lymphocyte")), structure(list(
Overall = c(100, 57.8, 53.88, 53.84, 53.52, 52.31, 51.69,
50.15, 49.31, 47.05, 46.87, 46.54, 45.64, 44.87, 43.54, 43.03,
43, 42.83, 42.53, 41.43)), class = "data.frame", row.names = c("age",
"FBS", "WBC", "PR", "Neutrophyl", "Weight", "HB", "LDH", "Urea",
"Albumin", "Lymphocyte", "CPK", "SystolicBP", "Calcium", "ESR",
"Ferritin", "CRP", "PLT", "Creatinine", "EjectionFraction")),
structure(list(Overall = c(100, 43.41, 24.88, 24.63, 23.31,
21.47, 21.06, 20.68, 17.94, 17.29, 16.49, 16.11, 15.72, 15.28,
14.94, 14.68, 14.5, 14.38, 13.05, 12.96)), class = "data.frame", row.names = c("age",
"Albumin", "Weight", "FBS", "BS", "PR", "LDH", "Neutrophyl",
"BMI", "EjectionFraction", "CPK", "WBC", "ALP", "RR", "Lymphocyte",
"Cancers1", "CRP", "ESR", "Ddimer", "Ferritin")))
以下是如何计算 ESR
的平均值,它存在于所有元素中,而 CRP
不存在于其中一个元素中:
mylist |> lapply(function(dat) dat["ESR", "Overall"]) |> unlist() |> mean(na.rm = TRUE)
#[1] 28.52667
mylist |> lapply(function(dat) dat["CRP", "Overall"]) |> unlist() |> mean(na.rm = TRUE)
#[1] 28.75
因为您有很多特征,您可以创建另一个函数来将此步骤应用于每个特征。例如:
features <- c("ESR", "CRP", "CPK", "WBC", "LDH")
feature_mean <- function(feature_name){
out <- lapply(mylist, function(dat) dat[feature_name, "Overall"])|>
unlist() |> mean(na.rm = TRUE) |>
setNames(paste0("mean_",feature_name))
return(out)
}
features |> lapply(feature_mean) |> unlist()
#mean_ESR mean_CRP mean_CPK mean_WBC mean_LDH
#28.52667 28.75000 29.57000 30.78333 30.81333
编辑
上一示例中使用的合成数据 mylist
在其每个元素中仅包含一个“整体”数据框对象,因此可以将特征提取直接应用于数据lapply
。但是,您在更新后的问题中提供的实际数据 importance_rf
在其每个元素中都有多个对象,“总体”数据框对象位于第一个元素中。不同之处在于您在评论中显示的错误原因。要应用提取,应首先使用 lapply(function(list) list[[1]])
提取“整体”数据帧,然后应用前面的步骤。
# Extract mean ESR
importance_rf |>
lapply(function(list) list[[1]]) |>
lapply(function(dat) dat["ESR", "Overall"]) |>
unlist() |>
mean(na.rm = TRUE)
#[1] 23.98857
# Extract mean CRP
importance_rf |>
lapply(function(list) list[[1]]) |>
lapply(function(dat) dat["CRP", "Overall"]) |>
unlist() |>
mean(na.rm = TRUE)
#[1] 17.4323
一个{base R}方式
可以将前面的步骤应用于特征向量,如下所示:
features <- c("ESR", "CRP", "CPK", "WBC", "LDH")
feature_mean <- function(feature_name){
out <- importance_rf |>
lapply(function(list) list[[1]]) |>
lapply(function(dat) dat[feature_name, "Overall"])|>
unlist() |> mean(na.rm = TRUE) |>
setNames(paste0("mean_",feature_name))
return(out)
}
# Extract the mean values
features |> lapply(feature_mean) |> unlist()
#mean_ESR mean_CRP mean_CPK mean_WBC mean_LDH
#23.98857 17.43230 26.19575 26.52498 30.44491
关于代码的简单说明:
lapply(function(list) list[[1]])
提取important_rf
列表中每个元素的第一个元素,即包含特征数据的数据框。dat[feature_name, "Overall"]
在每个提取的数据帧中提取目标特征的值feature_name
。每一步只从每个数据帧中提取一个特征。unlist()
将提取特征的数据结构从列表转换为数值向量。setNames
为数值向量创建名称,以便于识别正在计算均值的特征。
这样使用的函数都属于base R
类。
您无需安装任何外部包即可获取它们。
另一种选择是将基本 R 函数与 purrr
包中的其他函数组合使用。
一个{purrr}
方式
library(purrr)
importance_rf |>
map(pluck(1,1)) |>
map(function(dat) set_names(dat[features,], features)) |>
as.data.frame() |>
rowMeans() |>
set_names(paste0("mean_", features))
#mean_ESR mean_CRP mean_CPK mean_WBC mean_LDH
#23.98857 17.43230 26.19575 26.52498 30.44491
这些步骤比上面 base R 中的步骤短得多,但每个步骤中所做的事情可能不太明显。
请注意,map
与 lapply
类似,pluck(x,1,1)
与 x[[1]][[1]]
等价。
关于代码的简单说明:
map(pluck(1,1))
提取数据帧,与上面的lapply(function(list) list[[1]])
类似。map(function(dat) set_names(dat[features,], features))
提取特征列表,类似于上面的dat[feature_name, "Overall"]
。
有区别:
在上面的base R方式中,从所有数据帧中提取每个特征,然后计算平均值,然后以相同的方式提取另一个特征。
在这种purrr方式中,从列表中的每个数据框中提取所有目标特征,然后使用as.data.frame
将这些特征组合成一个新的数据框,这样每一行代表一个特征.然后,rowMeans
用于计算特征所有值的平均值。
请注意,您可以在 |>
管道之前检查每个步骤的结果。例如,importance_rf
将显示每个元素中的所有对象。
importance_rf |> map(pluck(1,1))
将仅显示数据框对象。
包含加权均值的更新
这是一个简单的示例,说明如何计算列表中每个功能的加权平均值。假设你有这个列表:
some.list <- list(L1 = c(a = 2, b = 4, c = 7),
L2 = c(a = 5, b = 5, c = 2),
L3 = c(a = 3, b = 3, c = 6))
some.list
$L1
a b c
2 4 7
$L2
a b c
5 5 2
$L3
a b c
3 3 6
假设列表中的 L1、L2 和 L3 具有以下权重值:
weight <- c(w.L1 = 0.5, w.L2=0.6, w.L3 = 0.9)
weight
w.L1 w.L2 w.L3
0.5 0.6 0.9
计算a的加权均值,例如需要这样计算:
您可以通过将列表中 a 的每个值乘以相应的归一化权重来获得此值。在这种情况下,w1 的归一化权重为 w1/(w1+w2+w3)
.
要在 R 中执行这些步骤:
norm.weight <- weight/sum(weight)
norm.weight
w.L1 w.L2 w.L3
0.25 0.30 0.45
# weighted means of a,b, and c
some.list |> map2(norm.weight, `*`) |> as.data.frame() |> rowSums()
a b c
3.35 3.85 5.05
将这些模拟 weight
值应用于您的 importance_rf
列表和示例中的 features
,我们得到:
importance_rf |>
map(pluck(1,1)) |>
map(function(dat) set_names(dat[features,], features)) |>
map2(norm.weight, `*`) |>
as.data.frame() |>
rowSums()
ESR CRP CPK WBC LDH
23.68084 17.36211 26.72970 25.59180 31.29827