如何在 sapply 函数上放置一个循环来为多个参与者创建一个子集？

Question

对于以下问题，我将不胜感激：

我有多个巨大的日志文件（每个都超过 1.000.000 个条目），其中包含一些我特别感兴趣的行（行）。所以我想制作一个仅包含这些行的子集，但我想将结果写入包含不止一条 Logfile/Participant 的信息的矩阵中。所以我创建了一小段代码来 1. 创建子集和 2. 运行它在一个循环中，不仅对一个日志文件，而且对所有日志文件都这样做。

  Result <- subset(df, df$columnOfInterest== "interestingCondition1" | df$columnOfInterest== "interestingCondition2" | df$columnOfInterest== "interestingCondition3")$columnOfInterest
  View(Result)

1
interestingCondition1
2
interestingCondition1
3
interestingCondition2
4
interestingCondition1
5
interestingCondition1
6
interestingCondition3
7
interestingCondition2
8
interestingCondition1
9
interestingCondition1
10
interestingCondition1

嵌入循环：

WrongResult <- matrix(data=NA,nrow=TrialNumber, ncol=length(ListOfFiles))
vpncount <- 1
for (v in ListOfFiles){

  df<- read.delim(v, header = TRUE, sep='\t')
  WrongResult[,vpncount] <- subset(df, df$columnOfInterest== "interestingCondition1" | df$columnOfInterest== "interestingCondition2" | df$columnOfInterest== "interestingCondition3")$columnOfInterest

vpncount <- vpncount+1

}

当运行在一个日志文件上使用代码时，我得到了我想要的结果，但是当运行通过循环使用它时，它创建了一个具有适当大小的矩阵，但只是填充用 "random" 数字代替我细分的条件。

有谁知道为什么会这样以及如何解决？非常感谢任何帮助！

编辑：

我尝试创建一个示例数据框。第一行代码（包括变量 Results）就像我想要的那样工作。它在我的 columnOfInterest 的行上过滤我的数据框，并将它们放入一个新矩阵中。但是，如果我尝试在一个循环中运行它处理多个数据帧，我就会运行出错：

df <- data.frame(
  X = sample(1:10),
  columnOfInterest= sample(c("interestingCondition1", "interestingCondition2", "interestingCondition3", "NotinterestingCondition1"), 10, replace = TRUE)
)

View(df)

Result <- subset(df, df$columnOfInterest== "interestingCondition1" | df$columnOfInterest== "interestingCondition2" | df$columnOfInterest== "interestingCondition3")$columnOfInterest
View(Result)

WrongResult <- matrix(data=NA,nrow=280, ncol=20)
vpncount <- 1
for (v in 1:20){

  df<- read.delim(v, header = TRUE, sep='\t')
  WrongResult[,vpncount] <- subset(df, df$columnOfInterest== "interestingCondition1" | df$columnOfInterest== "interestingCondition2" | df$columnOfInterest== "interestingCondition3")$columnOfInterest

  vpncount <- vpncount+1

}

View(WrongResult)

Answer 1

我不记得如何使用 data.frame 进行操作，所以我将尝试使用 data.table。你可能需要安装 data.table 包以防你没有它 install.packages("data.table")

library(data.table)
dt <- data.table(df)

然后你可以用下面的方式重写你的代码

subset..table <- function(dt){
    dt[columnOfInterest %in% c("interestingCondition1",
                               "interestingCondition2",
                               "interestingCondition3"),columnOfInterest]
}


myfun <- function(x){
### DD
    ## x interp string representing  file name

### Purpose
    ## read and subset

    dt <- fread(x,header=TRUE,sep="\t")
    subset..table(dt)

}

res..list <- lapply(ListOfFiles, myfun)

编辑

例如使用您的示例。

df <- data.frame(
  X = sample(1:10),
  columnOfInterest= sample(c("interestingCondition1",
    "interestingCondition2", "interestingCondition3", 
    "NotinterestingCondition1"), 10, replace = TRUE))


dt <- data.table(df)
subset..table(dt)

会产生

#[1] "interestingCondition2" "interestingCondition3" "interestingCondition1"
#[4] "interestingCondition2" "interestingCondition1" "interestingCondition2"
#[7] "interestingCondition3" "interestingCondition1" "interestingCondition3"

如果您对函数subset..满意table，那么您只需使用函数myfun即可得到您想要的。函数 fread 会自动给你一个 data.table。

Answer 2

在 tidyverse 领域，当您处理单个数据帧时，您希望 filter() 然后 select() 您的原始数据，为了方便添加，使用 mutate() , 文件名。当有多个可能值时，一种很好的过滤方法是使用 %in%。所以

library(tidyverse)

process_1_df <- function(df, id, condition)
    select(df, columnOfInterest) %>%                 # only interesting column
        filter(columnOfInterest %in% condition) %>%  # specific rows
        mutate(id = id)                              # add identifier

condition <- paste0("interestingCondition", 1:3)
process_1_df(df, "id", condition)

id 是一个标识符——如果 data.frame 来自文件 'foo.txt'，则使用 "foo.txt" 作为 id。最初的问题试图将来自多个文件的数据表示为矩阵，但假设每个文件都选择了相同数量的有趣行。这里的策略是创建一个数据框，其中包含有趣条件的来源文件以及有趣条件的值。此数据框在处理多个文件时很有用...

这适用于示例数据集：

> condition <- paste0("interestingCondition", 1:3)
> process_1_df(df, "id", condition)
       columnOfInterest id
1 interestingCondition2 id
2 interestingCondition2 id
3 interestingCondition3 id
4 interestingCondition1 id
5 interestingCondition3 id
6 interestingCondition1 id

您可以扩展它来处理文件

process_1_file <- function(file_name, condition)
    read_csv(file_name) %>%                   # better: input only columnOfInterest
        process_1_df(file_name, condition)

正如@DJJ 所建议的，process_1_file() 的data.table 实现可能非常紧凑和高效——fread(file_name)[columnOfInterest %in% condition, columnOfInterest]

要处理多个文件，请使用 purr 包

library(purrr)
process_files <- function(file_names, condition)
    map(file_names, process_1_file, condition) %>%
        bind_rows()

dir(pattern="*.csv") %>% process_files(condition)

最终结果是单个数据框，其中有一列是有趣的条件，另一列指示有趣的条件来自哪个日志文件。现在可以根据需要处理/汇总此 'long' 格式的数据框。

Answer 3

Does anyone knows why that happens?

你的循环是……不工作。原因有点复杂，但我已经使用简单的循环（没有 *apply 函数）在 base R 中做了一个工作示例，希望你能跟进，并希望它能充分代表你的问题。

先学会走路运行。在学习如何使用 apply()、lapply() 等更简洁地完成循环之前，先学习基本循环。在深入研究非标准评估（data.table、tidyverse、purrr 等）

首先我们将创建一些数据框并将它们写入文件

owd <- getwd()
dir.create("sotest")
setwd("sotest")

set.seed(1)

flist <- c("dtf1.txt", "dtf2.txt", "dtf3.txt")

for (i in 1:length(flist)) {
    dtf <- data.frame(
      X=sample(1:10),
      coi=sample(c("ic1", "ic2", "ic3", "nic1"), 10, replace=TRUE)
    )
    write.table(dtf, flist[i], row.names=FALSE, sep="\t")
}

在运行之后，您应该有一个名为 "sotest" 的文件夹，其中包含三个制表符分隔的 txt 文件。

然后我们将获得可用文件的列表，并对其进行循环。

flist <- list.files(pattern=".txt")
WrongResult <- list()
interesting <- c("ic1", "ic2", "ic3")

for (v in 1:length(flist)) {

    dtf <- read.delim(flist[v], header=TRUE, sep="\t", stringsAsFactors=FALSE)
    WrongResult[[v]] <- dtf[dtf$coi %in% interesting, "coi"]

}

WrongResult

setwd(owd)

我将输出存储为列表而不是矩阵，因为在循环的每次迭代中生成的对象的长度都不相同。

如何在 sapply 函数上放置一个循环来为多个参与者创建一个子集？

How to put a loop on an sapply function to create a subset for multiple participants?

loops

r

matrix

sapply