嵌套列表中的数据清理和子集化
Data cleaning & subsetting in nested list
我在嵌套列表中找不到解决这些步骤的任何先前问题。我自己的努力也没有让我到任何地方!
我有一个嵌套列表 df
。
- 我想更改中的前 3 列的列名
所有 data.frames 到
c("one","two","three")
.
- 在每个数据框中要保留前 3 列和列
与列表中的数据框名称同名。
- 现在每个数据框有 4 列。如果第四列的值大于 3,我想在每个数据框中保留第二列的值。
- Return 一个嵌套列表,包含每个数据框的名称并被选中
第二列中的值(在步骤 4 中)。
Purrr
和 dplyr
方法是首选,但其他一切都非常感谢!
> dput(map_depth(df,1, head))
list(`CD8_C01-LEF1` = structure(list(...1 = c("1236", "6194",
"51176", "6402", "6137", "1937"), ...2 = c("CCR7", "RPS6", "LEF1",
"SELL", "RPL13", "EEF1G"), ...3 = c(448.275813024615, 114.565282822255,
405.993571415472, 352.462886197845, 152.430598462657, 73.5226212775651
), `P-value*` = c(0, 2.35914832807463e-150, 0, 0, 1.03146807397557e-195,
3.00681346250943e-98), `CD8_C01-LEF1` = c(6.3388353508401, 1.36075129906401,
5.11667843995657, 5.22902495053118, 1.35703181746742, 1.72815687302818
), `CD8_C02-GPR183` = c(2.71993044636725, 0.755445092850178,
2.26029822474036, 3.57732840656951, 0.757664532314421, 0.732003573596204
), `CD8_C03-CX3CR1` = c(-2.50016459757821, 0.0430813598361915,
-1.47763877045973, -1.31104077043168, -0.118054173396857, -0.217984797372657
), `CD8_C04-GZMK` = c(-0.639352384551204, -0.304854019068466,
-1.400271288872, -1.56965980479594, -0.128422617265835, -0.701864111617954
), `CD8_C05-CD6` = c(-2.35873754058284, -0.115888861319928, -2.08628173736428,
-3.32630706764402, -0.177640817498698, -0.215754243123614), `CD8_C06-CD160` = c(-2.85558322130952,
-0.29530343951866, -2.20232116143474, -3.274807762691, -0.440783845861116,
-0.56207661416919), `CD8_C07-LAYN` = c(-2.75671138163062, -0.887003245107014,
-2.40845402752497, -3.47698326675668, -1.03656381624963, -1.46468960616135
), `CD8_C08-SLC4A10` = c(-2.68199272253543, 0.0292368512820967,
-2.1581654239029, -2.99895134853712, 0.0615744908900675, 0.192173783941343
)), row.names = c(NA, 6L), class = "data.frame"), `CD8_C02-GPR183` = structure(list(
...1 = c("3575", "4050", "1901", "6653", "1880", "10628"),
...2 = c("IL7R", "LTB", "S1PR1", "SORL1", "GPR183", "TXNIP"
), ...3 = c(268.347035159053, 151.397715576146, 423.815475272167,
154.131971403975, 161.502687932662, 138.188069200824), `P-value*` = c(0,
1.63481853000449e-194, 0, 1.09616441981898e-197, 3.47999420200636e-206,
5.87606326954945e-179), `CD8_C01-LEF1` = c(2.25872137515665,
1.06433926285014, 2.06890434595653, 1.77222927526522, -2.32256398023726,
1.17445992511194), `CD8_C02-GPR183` = c(3.58534594694992,
2.33774626980998, 3.1044712936119, 3.00075778716827, 1.54874669286004,
2.11053414857411), `CD8_C03-CX3CR1` = c(-2.73122665345433,
-3.23251051546321, 2.76359001828421, 0.899851788567591, -3.4595583469893,
1.9924219816788), `CD8_C04-GZMK` = c(-1.20359289904198, -2.27859013855459,
-0.289843306560729, 0.0930099548084882, 0.293766916539111,
-1.05998934689132), `CD8_C05-CD6` = c(0.771026257612103,
-1.84446654315228, -1.92859019625536, -0.993527571866541,
-0.517242518264243, -1.05505195656161), `CD8_C06-CD160` = c(-1.26433565787961,
-3.62072638085859, -1.99838091859197, -2.66224984657089,
-3.84677781455005, -0.741084525734145), `CD8_C07-LAYN` = c(-4.85420539962432,
-3.79535857695107, -2.07599716553024, -2.41001692585172,
-3.66993376805675, -1.90910214659534), `CD8_C08-SLC4A10` = c(1.79563839118781,
0.431971358693421, 0.24665792844753, 0.820564247625701, -0.941462395796914,
0.224912511574641)), row.names = c(NA, 6L), class = "data.frame"),
`CD8_C03-CX3CR1` = structure(list(...1 = c("5341", "1524",
"83888", "2214", "343413", "10219"), ...2 = c("PLEK", "CX3CR1",
"FGFBP2", "FCGR3A", "FCRL6", "KLRG1"), ...3 = c(372.816216710618,
713.554708746553, 575.834099328186, 419.996034284325, 215.715234731706,
281.827177706662), `P-value*` = c("0", "0", "0", "0", "3.5450627744914998E-266",
"0"), `CD8_C01-LEF1` = c(-1.34745098111019, -0.39476162886016,
-0.248194028712413, -0.326944139043036, -0.833877751680806,
-0.822668603983214), `CD8_C02-GPR183` = c(0.50737446056126,
-0.495638146054913, -0.484905896571723, -0.125753818325312,
0.0263098770399738, 0.894340812937189), `CD8_C03-CX3CR1` = c(6.36825282208761,
5.38301238794739, 5.26196506464758, 5.6197563760267, 5.8532850807879,
5.36851683724817), `CD8_C04-GZMK` = c(1.44463895049283, -0.513803138075432,
-0.125340966094923, 0.2447981258131, 1.34537977512099, 2.10784813093189
), `CD8_C05-CD6` = c(-0.718776566594413, -0.795121492384525,
-0.681892196238474, -0.421395883952147, 0.0987360993173341,
-1.35585804120358), `CD8_C06-CD160` = c(-0.550964233191398,
-0.794078725052049, -0.707741972359531, -0.156207202527366,
2.24842830259497, -1.28977809817504), `CD8_C07-LAYN` = c(0.0641870785667258,
-0.785201010640904, -0.631939964779986, -0.340799120353511,
0.271892089522186, 0.236064375692484), `CD8_C08-SLC4A10` = c(1.40102283829925,
-0.158585496249154, -0.056110756095033, 0.00915832466806331,
-0.085141865592199, 3.78847417230501)), row.names = c(NA,
6L), class = "data.frame"))
一个解决方案是:
res <- lapply(setNames(nm = names(df)), function(dfname) {
dff <- df[[dfname]]
# only renaming column 2 as columns 1 and 3 are not used later on
colnames(dff)[2] <- "two"
# not 'keeping' the column with the same name as the dataframe, just using the dataframe straightaway
dff$two[dff[,dfname] > 3]
})
注意 setNames(...)
语句作为 lapply
的第一个参数。如果将命名列表发送到 lapply
,它将使用元素的名称作为它 returns.
的元素的名称
这里有一个 purrr
和 dplyr
的解决方案:
library(tidyverse)
map2(df_list, names(df_list),
\(dat, name) {
dat |>
select(one = ...1,
two = ...2,
three = ...3,
all_of(name)) |>
(\(d) filter(d, d[,4] > 3))() |>
pull(two)
}
)
#> $`CD8_C01-LEF1`
#> [1] "CCR7" "LEF1" "SELL"
#>
#> $`CD8_C02-GPR183`
#> [1] "IL7R" "S1PR1" "SORL1"
#>
#> $`CD8_C03-CX3CR1`
#> [1] "PLEK" "CX3CR1" "FGFBP2" "FCGR3A" "FCRL6" "KLRG1"
编辑:解释
map2
= 我在这里使用它是因为您有一个数据框列表,并且 map
与列表配合得很好。我使用“2”变体,因为您还想 select 基于列表名称的列。
\(dat, name)
= 使用来自 map2
的两个输入创建一个匿名函数,其中我将数据定义为 dat
并将列表名称定义为 name
。
select(one = ...1, two = ...2, three = ...3, all_of(name))
= 在这里我 select 并根据您在问题中的要求重命名前三列,我还 select 作为列表名称的列all_of(name)
。请记住,name
是匿名函数中为列表名称定义的变量名称。
(\(d) filter(d, d[,4] > 3))()
= 这是一个有点古怪的语法,因为我喜欢使用本机管道运算符 (|>
) 而不是 magritr
管道运算符 (%>%
) .这意味着我创建了另一个将当前数据定义为 d
的匿名函数 (\(d)
)。然后我根据第4列大于3(即d[,4] > 3
)filter
d
。如果你使用 magritr
管道,这可以简化为 filter(.[,4] > 3)
。更好的方法是使用 non-standard 评估来完全避免使用匿名函数,但我很难弄清楚 {{}}
、quo
、enquo
和!!
带引号的列名。
pull(two)
= 最后,我们 select 只有来自名为 two
的列的值。
编辑 2:清理代码。
我想出非标准的 eval 来清理奇怪的语法。
map2(df_list, names(df_list),
\(dat, name) {
dat |>
select(one = ...1,
two = ...2,
three = ...3,
all_of(name)) |>
filter(!!sym(all_of(name)) > 3) |>
pull(two)
}
)
#> $`CD8_C01-LEF1`
#> [1] "CCR7" "LEF1" "SELL"
#>
#> $`CD8_C02-GPR183`
#> [1] "IL7R" "S1PR1" "SORL1"
#>
#> $`CD8_C03-CX3CR1`
#> [1] "PLEK" "CX3CR1" "FGFBP2" "FCGR3A" "FCRL6" "KLRG1"
我在嵌套列表中找不到解决这些步骤的任何先前问题。我自己的努力也没有让我到任何地方!
我有一个嵌套列表 df
。
- 我想更改中的前 3 列的列名
所有 data.frames 到
c("one","two","three")
. - 在每个数据框中要保留前 3 列和列 与列表中的数据框名称同名。
- 现在每个数据框有 4 列。如果第四列的值大于 3,我想在每个数据框中保留第二列的值。
- Return 一个嵌套列表,包含每个数据框的名称并被选中 第二列中的值(在步骤 4 中)。
Purrr
和 dplyr
方法是首选,但其他一切都非常感谢!
> dput(map_depth(df,1, head))
list(`CD8_C01-LEF1` = structure(list(...1 = c("1236", "6194",
"51176", "6402", "6137", "1937"), ...2 = c("CCR7", "RPS6", "LEF1",
"SELL", "RPL13", "EEF1G"), ...3 = c(448.275813024615, 114.565282822255,
405.993571415472, 352.462886197845, 152.430598462657, 73.5226212775651
), `P-value*` = c(0, 2.35914832807463e-150, 0, 0, 1.03146807397557e-195,
3.00681346250943e-98), `CD8_C01-LEF1` = c(6.3388353508401, 1.36075129906401,
5.11667843995657, 5.22902495053118, 1.35703181746742, 1.72815687302818
), `CD8_C02-GPR183` = c(2.71993044636725, 0.755445092850178,
2.26029822474036, 3.57732840656951, 0.757664532314421, 0.732003573596204
), `CD8_C03-CX3CR1` = c(-2.50016459757821, 0.0430813598361915,
-1.47763877045973, -1.31104077043168, -0.118054173396857, -0.217984797372657
), `CD8_C04-GZMK` = c(-0.639352384551204, -0.304854019068466,
-1.400271288872, -1.56965980479594, -0.128422617265835, -0.701864111617954
), `CD8_C05-CD6` = c(-2.35873754058284, -0.115888861319928, -2.08628173736428,
-3.32630706764402, -0.177640817498698, -0.215754243123614), `CD8_C06-CD160` = c(-2.85558322130952,
-0.29530343951866, -2.20232116143474, -3.274807762691, -0.440783845861116,
-0.56207661416919), `CD8_C07-LAYN` = c(-2.75671138163062, -0.887003245107014,
-2.40845402752497, -3.47698326675668, -1.03656381624963, -1.46468960616135
), `CD8_C08-SLC4A10` = c(-2.68199272253543, 0.0292368512820967,
-2.1581654239029, -2.99895134853712, 0.0615744908900675, 0.192173783941343
)), row.names = c(NA, 6L), class = "data.frame"), `CD8_C02-GPR183` = structure(list(
...1 = c("3575", "4050", "1901", "6653", "1880", "10628"),
...2 = c("IL7R", "LTB", "S1PR1", "SORL1", "GPR183", "TXNIP"
), ...3 = c(268.347035159053, 151.397715576146, 423.815475272167,
154.131971403975, 161.502687932662, 138.188069200824), `P-value*` = c(0,
1.63481853000449e-194, 0, 1.09616441981898e-197, 3.47999420200636e-206,
5.87606326954945e-179), `CD8_C01-LEF1` = c(2.25872137515665,
1.06433926285014, 2.06890434595653, 1.77222927526522, -2.32256398023726,
1.17445992511194), `CD8_C02-GPR183` = c(3.58534594694992,
2.33774626980998, 3.1044712936119, 3.00075778716827, 1.54874669286004,
2.11053414857411), `CD8_C03-CX3CR1` = c(-2.73122665345433,
-3.23251051546321, 2.76359001828421, 0.899851788567591, -3.4595583469893,
1.9924219816788), `CD8_C04-GZMK` = c(-1.20359289904198, -2.27859013855459,
-0.289843306560729, 0.0930099548084882, 0.293766916539111,
-1.05998934689132), `CD8_C05-CD6` = c(0.771026257612103,
-1.84446654315228, -1.92859019625536, -0.993527571866541,
-0.517242518264243, -1.05505195656161), `CD8_C06-CD160` = c(-1.26433565787961,
-3.62072638085859, -1.99838091859197, -2.66224984657089,
-3.84677781455005, -0.741084525734145), `CD8_C07-LAYN` = c(-4.85420539962432,
-3.79535857695107, -2.07599716553024, -2.41001692585172,
-3.66993376805675, -1.90910214659534), `CD8_C08-SLC4A10` = c(1.79563839118781,
0.431971358693421, 0.24665792844753, 0.820564247625701, -0.941462395796914,
0.224912511574641)), row.names = c(NA, 6L), class = "data.frame"),
`CD8_C03-CX3CR1` = structure(list(...1 = c("5341", "1524",
"83888", "2214", "343413", "10219"), ...2 = c("PLEK", "CX3CR1",
"FGFBP2", "FCGR3A", "FCRL6", "KLRG1"), ...3 = c(372.816216710618,
713.554708746553, 575.834099328186, 419.996034284325, 215.715234731706,
281.827177706662), `P-value*` = c("0", "0", "0", "0", "3.5450627744914998E-266",
"0"), `CD8_C01-LEF1` = c(-1.34745098111019, -0.39476162886016,
-0.248194028712413, -0.326944139043036, -0.833877751680806,
-0.822668603983214), `CD8_C02-GPR183` = c(0.50737446056126,
-0.495638146054913, -0.484905896571723, -0.125753818325312,
0.0263098770399738, 0.894340812937189), `CD8_C03-CX3CR1` = c(6.36825282208761,
5.38301238794739, 5.26196506464758, 5.6197563760267, 5.8532850807879,
5.36851683724817), `CD8_C04-GZMK` = c(1.44463895049283, -0.513803138075432,
-0.125340966094923, 0.2447981258131, 1.34537977512099, 2.10784813093189
), `CD8_C05-CD6` = c(-0.718776566594413, -0.795121492384525,
-0.681892196238474, -0.421395883952147, 0.0987360993173341,
-1.35585804120358), `CD8_C06-CD160` = c(-0.550964233191398,
-0.794078725052049, -0.707741972359531, -0.156207202527366,
2.24842830259497, -1.28977809817504), `CD8_C07-LAYN` = c(0.0641870785667258,
-0.785201010640904, -0.631939964779986, -0.340799120353511,
0.271892089522186, 0.236064375692484), `CD8_C08-SLC4A10` = c(1.40102283829925,
-0.158585496249154, -0.056110756095033, 0.00915832466806331,
-0.085141865592199, 3.78847417230501)), row.names = c(NA,
6L), class = "data.frame"))
一个解决方案是:
res <- lapply(setNames(nm = names(df)), function(dfname) {
dff <- df[[dfname]]
# only renaming column 2 as columns 1 and 3 are not used later on
colnames(dff)[2] <- "two"
# not 'keeping' the column with the same name as the dataframe, just using the dataframe straightaway
dff$two[dff[,dfname] > 3]
})
注意 setNames(...)
语句作为 lapply
的第一个参数。如果将命名列表发送到 lapply
,它将使用元素的名称作为它 returns.
这里有一个 purrr
和 dplyr
的解决方案:
library(tidyverse)
map2(df_list, names(df_list),
\(dat, name) {
dat |>
select(one = ...1,
two = ...2,
three = ...3,
all_of(name)) |>
(\(d) filter(d, d[,4] > 3))() |>
pull(two)
}
)
#> $`CD8_C01-LEF1`
#> [1] "CCR7" "LEF1" "SELL"
#>
#> $`CD8_C02-GPR183`
#> [1] "IL7R" "S1PR1" "SORL1"
#>
#> $`CD8_C03-CX3CR1`
#> [1] "PLEK" "CX3CR1" "FGFBP2" "FCGR3A" "FCRL6" "KLRG1"
编辑:解释
map2
= 我在这里使用它是因为您有一个数据框列表,并且 map
与列表配合得很好。我使用“2”变体,因为您还想 select 基于列表名称的列。
\(dat, name)
= 使用来自 map2
的两个输入创建一个匿名函数,其中我将数据定义为 dat
并将列表名称定义为 name
。
select(one = ...1, two = ...2, three = ...3, all_of(name))
= 在这里我 select 并根据您在问题中的要求重命名前三列,我还 select 作为列表名称的列all_of(name)
。请记住,name
是匿名函数中为列表名称定义的变量名称。
(\(d) filter(d, d[,4] > 3))()
= 这是一个有点古怪的语法,因为我喜欢使用本机管道运算符 (|>
) 而不是 magritr
管道运算符 (%>%
) .这意味着我创建了另一个将当前数据定义为 d
的匿名函数 (\(d)
)。然后我根据第4列大于3(即d[,4] > 3
)filter
d
。如果你使用 magritr
管道,这可以简化为 filter(.[,4] > 3)
。更好的方法是使用 non-standard 评估来完全避免使用匿名函数,但我很难弄清楚 {{}}
、quo
、enquo
和!!
带引号的列名。
pull(two)
= 最后,我们 select 只有来自名为 two
的列的值。
编辑 2:清理代码。
我想出非标准的 eval 来清理奇怪的语法。
map2(df_list, names(df_list),
\(dat, name) {
dat |>
select(one = ...1,
two = ...2,
three = ...3,
all_of(name)) |>
filter(!!sym(all_of(name)) > 3) |>
pull(two)
}
)
#> $`CD8_C01-LEF1`
#> [1] "CCR7" "LEF1" "SELL"
#>
#> $`CD8_C02-GPR183`
#> [1] "IL7R" "S1PR1" "SORL1"
#>
#> $`CD8_C03-CX3CR1`
#> [1] "PLEK" "CX3CR1" "FGFBP2" "FCGR3A" "FCRL6" "KLRG1"