嵌套列表中的数据清理和子集化

Data cleaning & subsetting in nested list

我在嵌套列表中找不到解决这些步骤的任何先前问题。我自己的努力也没有让我到任何地方!

我有一个嵌套列表 df

  1. 我想更改中的前 3 列的列名 所有 data.frames 到 c("one","two","three").
  2. 在每个数据框中要保留前 3 列和列 与列表中的数据框名称同名。
  3. 现在每个数据框有 4 列。如果第四列的值大于 3,我想在每个数据框中保留第二列的值。
  4. Return 一个嵌套列表,包含每个数据框的名称并被选中 第二列中的值(在步骤 4 中)。

Purrrdplyr 方法是首选,但其他一切都非常感谢!

> dput(map_depth(df,1, head))
list(`CD8_C01-LEF1` = structure(list(...1 = c("1236", "6194", 
"51176", "6402", "6137", "1937"), ...2 = c("CCR7", "RPS6", "LEF1", 
"SELL", "RPL13", "EEF1G"), ...3 = c(448.275813024615, 114.565282822255, 
405.993571415472, 352.462886197845, 152.430598462657, 73.5226212775651
), `P-value*` = c(0, 2.35914832807463e-150, 0, 0, 1.03146807397557e-195, 
3.00681346250943e-98), `CD8_C01-LEF1` = c(6.3388353508401, 1.36075129906401, 
5.11667843995657, 5.22902495053118, 1.35703181746742, 1.72815687302818
), `CD8_C02-GPR183` = c(2.71993044636725, 0.755445092850178, 
2.26029822474036, 3.57732840656951, 0.757664532314421, 0.732003573596204
), `CD8_C03-CX3CR1` = c(-2.50016459757821, 0.0430813598361915, 
-1.47763877045973, -1.31104077043168, -0.118054173396857, -0.217984797372657
), `CD8_C04-GZMK` = c(-0.639352384551204, -0.304854019068466, 
-1.400271288872, -1.56965980479594, -0.128422617265835, -0.701864111617954
), `CD8_C05-CD6` = c(-2.35873754058284, -0.115888861319928, -2.08628173736428, 
-3.32630706764402, -0.177640817498698, -0.215754243123614), `CD8_C06-CD160` = c(-2.85558322130952, 
-0.29530343951866, -2.20232116143474, -3.274807762691, -0.440783845861116, 
-0.56207661416919), `CD8_C07-LAYN` = c(-2.75671138163062, -0.887003245107014, 
-2.40845402752497, -3.47698326675668, -1.03656381624963, -1.46468960616135
), `CD8_C08-SLC4A10` = c(-2.68199272253543, 0.0292368512820967, 
-2.1581654239029, -2.99895134853712, 0.0615744908900675, 0.192173783941343
)), row.names = c(NA, 6L), class = "data.frame"), `CD8_C02-GPR183` = structure(list(
    ...1 = c("3575", "4050", "1901", "6653", "1880", "10628"), 
    ...2 = c("IL7R", "LTB", "S1PR1", "SORL1", "GPR183", "TXNIP"
    ), ...3 = c(268.347035159053, 151.397715576146, 423.815475272167, 
    154.131971403975, 161.502687932662, 138.188069200824), `P-value*` = c(0, 
    1.63481853000449e-194, 0, 1.09616441981898e-197, 3.47999420200636e-206, 
    5.87606326954945e-179), `CD8_C01-LEF1` = c(2.25872137515665, 
    1.06433926285014, 2.06890434595653, 1.77222927526522, -2.32256398023726, 
    1.17445992511194), `CD8_C02-GPR183` = c(3.58534594694992, 
    2.33774626980998, 3.1044712936119, 3.00075778716827, 1.54874669286004, 
    2.11053414857411), `CD8_C03-CX3CR1` = c(-2.73122665345433, 
    -3.23251051546321, 2.76359001828421, 0.899851788567591, -3.4595583469893, 
    1.9924219816788), `CD8_C04-GZMK` = c(-1.20359289904198, -2.27859013855459, 
    -0.289843306560729, 0.0930099548084882, 0.293766916539111, 
    -1.05998934689132), `CD8_C05-CD6` = c(0.771026257612103, 
    -1.84446654315228, -1.92859019625536, -0.993527571866541, 
    -0.517242518264243, -1.05505195656161), `CD8_C06-CD160` = c(-1.26433565787961, 
    -3.62072638085859, -1.99838091859197, -2.66224984657089, 
    -3.84677781455005, -0.741084525734145), `CD8_C07-LAYN` = c(-4.85420539962432, 
    -3.79535857695107, -2.07599716553024, -2.41001692585172, 
    -3.66993376805675, -1.90910214659534), `CD8_C08-SLC4A10` = c(1.79563839118781, 
    0.431971358693421, 0.24665792844753, 0.820564247625701, -0.941462395796914, 
    0.224912511574641)), row.names = c(NA, 6L), class = "data.frame"), 
    `CD8_C03-CX3CR1` = structure(list(...1 = c("5341", "1524", 
    "83888", "2214", "343413", "10219"), ...2 = c("PLEK", "CX3CR1", 
    "FGFBP2", "FCGR3A", "FCRL6", "KLRG1"), ...3 = c(372.816216710618, 
    713.554708746553, 575.834099328186, 419.996034284325, 215.715234731706, 
    281.827177706662), `P-value*` = c("0", "0", "0", "0", "3.5450627744914998E-266", 
    "0"), `CD8_C01-LEF1` = c(-1.34745098111019, -0.39476162886016, 
    -0.248194028712413, -0.326944139043036, -0.833877751680806, 
    -0.822668603983214), `CD8_C02-GPR183` = c(0.50737446056126, 
    -0.495638146054913, -0.484905896571723, -0.125753818325312, 
    0.0263098770399738, 0.894340812937189), `CD8_C03-CX3CR1` = c(6.36825282208761, 
    5.38301238794739, 5.26196506464758, 5.6197563760267, 5.8532850807879, 
    5.36851683724817), `CD8_C04-GZMK` = c(1.44463895049283, -0.513803138075432, 
    -0.125340966094923, 0.2447981258131, 1.34537977512099, 2.10784813093189
    ), `CD8_C05-CD6` = c(-0.718776566594413, -0.795121492384525, 
    -0.681892196238474, -0.421395883952147, 0.0987360993173341, 
    -1.35585804120358), `CD8_C06-CD160` = c(-0.550964233191398, 
    -0.794078725052049, -0.707741972359531, -0.156207202527366, 
    2.24842830259497, -1.28977809817504), `CD8_C07-LAYN` = c(0.0641870785667258, 
    -0.785201010640904, -0.631939964779986, -0.340799120353511, 
    0.271892089522186, 0.236064375692484), `CD8_C08-SLC4A10` = c(1.40102283829925, 
    -0.158585496249154, -0.056110756095033, 0.00915832466806331, 
    -0.085141865592199, 3.78847417230501)), row.names = c(NA, 
    6L), class = "data.frame"))

一个解决方案是:

res <- lapply(setNames(nm = names(df)), function(dfname) {
  dff <- df[[dfname]]

  # only renaming column 2 as columns 1 and 3 are not used later on
  colnames(dff)[2] <- "two" 

  # not 'keeping' the column with the same name as the dataframe, just using the dataframe straightaway   
  dff$two[dff[,dfname] > 3]
})

注意 setNames(...) 语句作为 lapply 的第一个参数。如果将命名列表发送到 lapply,它将使用元素的名称作为它 returns.

的元素的名称

这里有一个 purrrdplyr 的解决方案:

library(tidyverse)

map2(df_list, names(df_list), 
     \(dat, name) {
       dat |>
         select(one = ...1, 
                two = ...2, 
                three = ...3, 
                all_of(name)) |>
         (\(d) filter(d, d[,4] > 3))() |>
         pull(two)
         }
       )
#> $`CD8_C01-LEF1`
#> [1] "CCR7" "LEF1" "SELL"
#> 
#> $`CD8_C02-GPR183`
#> [1] "IL7R"  "S1PR1" "SORL1"
#> 
#> $`CD8_C03-CX3CR1`
#> [1] "PLEK"   "CX3CR1" "FGFBP2" "FCGR3A" "FCRL6"  "KLRG1"

编辑:解释

map2 = 我在这里使用它是因为您有一个数据框列表,并且 map 与列表配合得很好。我使用“2”变体,因为您还想 select 基于列表名称的列。

\(dat, name) = 使用来自 map2 的两个输入创建一个匿名函数,其中我将数据定义为 dat 并将列表名称定义为 name

select(one = ...1, two = ...2, three = ...3, all_of(name)) = 在这里我 select 并根据您在问题中的要求重命名前三列,我还 select 作为列表名称的列all_of(name)。请记住,name 是匿名函数中为列表名称定义的变量名称。

(\(d) filter(d, d[,4] > 3))() = 这是一个有点古怪的语法,因为我喜欢使用本机管道运算符 (|>) 而不是 magritr 管道运算符 (%>%) .这意味着我创建了另一个将当前数据定义为 d 的匿名函数 (\(d))。然后我根据第4列大于3(即d[,4] > 3filterd。如果你使用 magritr 管道,这可以简化为 filter(.[,4] > 3)。更好的方法是使用 non-standard 评估来完全避免使用匿名函数,但我很难弄清楚 {{}}quoenquo!! 带引号的列名。

pull(two) = 最后,我们 select 只有来自名为 two 的列的值。

编辑 2:清理代码。

我想出非标准的 eval 来清理奇怪的语法。

map2(df_list, names(df_list), 
     \(dat, name) {
       dat |>
         select(one = ...1, 
                two = ...2, 
                three = ...3, 
                all_of(name)) |>
         filter(!!sym(all_of(name)) > 3) |>
         pull(two)
         }
       )
#> $`CD8_C01-LEF1`
#> [1] "CCR7" "LEF1" "SELL"
#> 
#> $`CD8_C02-GPR183`
#> [1] "IL7R"  "S1PR1" "SORL1"
#> 
#> $`CD8_C03-CX3CR1`
#> [1] "PLEK"   "CX3CR1" "FGFBP2" "FCGR3A" "FCRL6"  "KLRG1"