在 R 中，如何根据数据框的值所在的 bin 对数据框的每一行进行分类？

Question

在 R 中，我想 class 通过对值进行分箱并使用每个分箱中的值的数量（总和）将它们分配到 2 组中来 [=48=] 化数据框的每一行（classes) 通过使用 if-else 逻辑。

在 R for 循环中，我使用 R cut 和 split 命令将按行的值。
区间（范围）为：1..9、10..19、20..29、30..39、40..49。
如果一行包含落在同一 bin（范围）中的一对值，比如说 10..19，那么它应该 class 化为 "P"。如果它包含 2 对落入 2 个不同的箱子（范围），那么它们应该是 class化为"PP".
然后我使用硬编码创建了 2 个名为 p 和 pp 的新变量 conditions/rules。变量中的值为 TRUE 或 FALSE，取决于第 n 行是否满足这些规则。
最后，我在 if-else 语句中使用 p 和 pp 作为条件来将每一行分配给 class P（第一行）或 class PP（第二行）。

首先，我创建了一个数据框x:

n1 <- c(1, 7); n2 <- c(2, 11); n3 <- c(10, 14); n4 <- c(23, 32); n5 <- c(37, 37); n6 <- c(45, 41)
x <- data.frame(n1, n2, n3, n4, n5, n6)
x
  n1 n2 n3 n4 n5 n6
1  1  2 10 23 37 45
2  7 11 14 32 37 41

第一行应该 class 化为 "P"，因为它有一对值 (1, 2) 落在同一个 bin 1..10.
第二行应该 class 化为 "PP"，因为它有 2 对值（11、14 和 32、37）落在 2 个 bin 中：10..19 和 30..39，因此.

因此，在创建数据框 x 之后，我创建了一个 for 循环：

for(i in nrow(x)){

# binning the data:
  bins <- split(as.numeric(x[i, ]), cut(as.numeric(x[i, ]), c(0, 9, 19, 29, 39, 49)))
  # creating the rule for p (1 pair of numbers falling in the same range)
  p <- (sum(lengths(bins) == 2) == 1 & sum(lengths(bins) == 1) == 4)
  # creating the rule for pp (2 different pairs, each has 2 numbers falling in the same range)
  pp <- (sum(lengths(bins) == 2) == 2 & sum(lengths(bins) == 1) == 2 & sum(lengths(bins) == 0) == 1)

  if(p){
    x$types <- "P"
  } else if(pp){
    x$types <- "PP"
  } else{
    stop("error")
  }
  }

print(x)

我想创建一个名为 types 的新列，其中包含 class P 或 PP:

  n1 n2 n3 n4 n5 n6 types
1  1  2 10 23 37 45 P
2  7 11 14 32 37 41 PP

代码只返回 PP:

  n1 n2 n3 n4 n5 n6 types
1  1  2 10 23 37 45 PP
2  7 11 14 32 37 41 PP

这是因为循环在行上运行了两次。但是如果它只运行一次，所有的行都会被 class 化为 "P"，而不是 "PP"。我希望这很简单，只是到目前为止还没有弄清楚。

Answer 1

不好看

x['types'] <- apply(x, 1, function(a) {stringr::str_replace_all(paste(+(table(floor(a/10)) > 1), collapse=""), c('1'='P','0'=''))})

开箱

floor(a/10) 转换为 bin
table(...) > 1 计数 bins 和 returns TRUE 对于那些 > 1
+(...) 将逻辑 TRUE/FALSE 转换为 1/0
paste(..., collapse="") 将字符串向量连接成不带空格的单个字符串
str_replace_all(..., c('1'='P', ...)) 使用定义为 'old'='new'

的模式替换替换所有子字符串

结果

  n1 n2 n3 n4 n5 n6 types
1  1  2 10 23 37 45     P
2  7 11 14 32 37 41    PP

Answer 2

您的 for 循环中的错误是您在分配 type 时没有使用 i。 x$types <- "P" 将整个 types 列指定为 "P"。 x$types <- "PP" 将整个 types 列指定为 "PP"。因此，无论最后的结果是什么，这将是整个列的值。

此外，在添加 types 列后使用整行 x[i, ] 是危险的。大概您不想尝试将 types 的 "P" 和 "PP" 值转换为数字并将它们装箱。我建议使 types 成为一个单独的向量，并且仅将其作为列添加到在循环之后。在循环之前：types <- chracter(nrow(x))。在循环内部：types[i] <- 而不是 x$types <-。循环后，x$types <- types.

当您表示 for (i in 1:nrow(x)).

时，您也在犯 for (i in nrow(x)) 的经典语法错误

修复所有这些：

n1 <- c(1, 7); n2 <- c(2, 11); n3 <- c(10, 14); n4 <- c(23, 32); n5 <- c(37, 37); n6 <- c(45, 41)
x <- data.frame(n1, n2, n3, n4, n5, n6)

types <- character(nrow(x))

for(i in 1:nrow(x)){
  # binning the data:
  bins <- split(as.numeric(x[i, ]), cut(as.numeric(x[i, ]), c(0, 9, 19, 29, 39, 49)))
  # creating the rule for p (1 pair of numbers falling in the same range)
  p <- (sum(lengths(bins) == 2) == 1 & sum(lengths(bins) == 1) == 4)
  # creating the rule for pp (2 different pairs, each has 2 numbers falling in the same range)
  pp <- (sum(lengths(bins) == 2) == 2 & sum(lengths(bins) == 1) == 2 & sum(lengths(bins) == 0) == 1)

  if(p){
    types[i] <- "P"
  } else if(pp){
    types[i] <- "PP"
  } else{
    stop("error")
  }
}

x$types <- types
x
#   n1 n2 n3 n4 n5 n6 types
# 1  1  2 10 23 37 45     P
# 2  7 11 14 32 37 41    PP

在 R 中，如何根据数据框的值所在的 bin 对数据框的每一行进行分类？

In R, how do I classify each row of a data frame based on the bin its values fall into?

grouping

for-loop

if-statement

r

binning