通过创建新变量重新排列行并在 R 中计算模式

Question

我有一个包含两列的数据集，“ID”和“CODCOM”，大约有 100 万行。第一列“ID”包含重复值。

ID	CODCOM
10000	12
101010	14
201020	11
201020	11
201020	12
324032	43
324032	43
324032	43
405044	51
323032	21

我想将“ID”重复值分组到不同的组中，然后计算每个组的众数，之后我想创建一个包含相关众数值的新列。像这样：

ID	CODCOM	NEW_COL
10000	12	12
101010	14	14
201020	11	11
201020	11	11
201020	12	11
324032	43	43
324032	43	43
324032	43	43
405044	51	51
323032	21	43

我怎样才能简单地做到这一点？

非常感谢您提供的任何帮助。

Answer 1

一种 dplyr 方法，在这种方法中，我使用最常见的 CODCOM 值（或第一次出现有联系）将数据连接到自身的一个版本。

library(dplyr)
df1 %>%
  left_join(
    df1 %>%
      group_by(ID) %>%
      count(mode = CODCOM, sort = TRUE) %>%
      slice(1),
    by = "ID"
  )


       ID CODCOM mode n
1   10000     12   12 1
2  101010     14   14 1
3  201020     11   11 2
4  201020     11   11 2
5  201020     12   11 2
6  324032     43   43 3
7  324032     43   43 3
8  324032     43   43 3
9  405044     51   51 1
10 323032     21   21 1

Answer 2

请使用包data.table找到以下一种解决方案：

REPREX

代码

library(data.table)


# Function to compute mode
mode_compute <- function(x) {
  uniqx <- unique(x)
  uniqx[which.max(tabulate(match(x, uniqx)))]
}

# Compute mode by ID
DT[ , MODE := mode_compute(CODCOM), by = ID]

输出

DT
#>         ID CODCOM MODE
#>  1:  10000     12   12
#>  2: 101010     14   14
#>  3: 201020     11   11
#>  4: 201020     11   11
#>  5: 201020     12   11
#>  6: 324032     43   43
#>  7: 324032     43   43
#>  8: 324032     43   43
#>  9: 405044     51   51
#> 10: 323032     21   21

数据：

# Data
DT <- data.table(ID = c("10000", "101010", "201020", "201020", "201020",
                 "324032", "324032", "324032", "405044", "323032"),
                 CODCOM = c(12, 14, 11, 11, 12, 43, 43, 43, 51, 21))
DT
#>         ID CODCOM
#>  1:  10000     12
#>  2: 101010     14
#>  3: 201020     11
#>  4: 201020     11
#>  5: 201020     12
#>  6: 324032     43
#>  7: 324032     43
#>  8: 324032     43
#>  9: 405044     51
#> 10: 323032     21

^{由 reprex package (v0.3.0)}

于 2021-10-11 创建

Answer 3

如果我没理解错的话：我们可以 group_by ID 然后使用 summarise 模式函数的 mode:

如果您不想 summarise，您可以使用 mutate（将保留所有行）！

library(dplyr)

mode <- function(codes){
  which.max(tabulate(codes))
}

df %>% 
  as_tibble() %>% 
  group_by(ID) %>% 
  summarise(NEW_COL = mode(CODCOM))

      ID NEW_COL
   <int>   <int>
1  10000      12
2 101010      14
3 201020      11
4 323032      21
5 324032      43
6 405044      51

Answer 4

基础 R 解决方案：

# Option 1 using TarJae's mode function:
# Apply the function groupwise, store result as vector:
# NEW_COL => integer vector
df$NEW_COL <- with(
  df,
  ave(
    CODCOM,
    ID,
    FUN = function(x){
      which.max(tabulate(x))
    }
  )
)

# Option two:
# Function to calculate the mode of a vector: 
# mode_statistic => function()
mode_statistic <- function(x){
  # Calculate the mode: res => vector
  res <- names(
    head(
      sort(
        table(
          x
        ),
        decreasing = TRUE
      ),
      1
    )
  )
  # Explicitly define returned object: character vector => env
  return(res)
}

# Apply the function groupwise, store result as vector:
# NEW_COL => integer vector
df$NEW_COL <- with(
  df,
  ave(
    CODCOM,
    ID,
    FUN = function(x){
      as.integer(
        mode_statistic(x)
      )
    }
  )
)

通过创建新变量重新排列行并在 R 中计算模式

Rearrange rows and calculate mode in R by creating a new variable

r

summary