如何匹配 2 标准内的最小差异和 return R 中的索引

Question

这是我的df。

index   firmcode    year    indcode     ROA
  0      a         2006      03         0.1
  1      b         2006      03         0.2
  2      c         2006      03         0.4
  3      d         2006      03         0.7   
  4      e         2006      07         0.3
  5      f         2006      07         0.8
  6      g         2006      07         1.1
  7      h         2006      07         2.1

我希望它是这样的。这是匹配最近公司的ROA（同年同indcode，同公司除外）

index   firmcode    year    indcode     ROA   diff_min_firmcode
  0      a         2006      03         0.1         b  
  1      b         2006      03         0.2         a
  2      c         2006      03         0.4         b          
  3      d         2006      03         0.7         c
  4      e         2006      07         0.3         f 
  5      f         2006      07         0.8         g 
  6      g         2006      07         1.1         f
  7      h         2006      07         2.1         g

如何获取 df['diff_min_firmcode'] 列？

Answer 1

平局会怎样？

你可以试试这个答案 -

fun <- function(a, b) {
  apply(abs(outer(a, a, `-`)), 1, function(x) b[x == min(x[x != 0])[1]])  
}

如果您有多个年份并且只想在每个特定年份内进行匹配，您可以这样做 -

library(dplyr)

df %>%
  group_by(year, indcode) %>%
  mutate(diff_min_firmcode = fun(ROA, firmcode)) %>%
  ungroup

#  index firmcode  year indcode   ROA diff_min_firmcode
#  <int> <chr>    <int>   <int> <dbl> <chr>            
#1     0 a         2006       3   0.1 b                
#2     1 b         2006       3   0.2 a                
#3     2 c         2006       3   0.4 b                
#4     3 d         2006       3   0.7 c                
#5     4 e         2006       7   0.3 f                
#6     5 f         2006       7   0.8 g                
#7     6 g         2006       7   1.1 f                
#8     7 h         2006       7   2.1 g

Answer 2

这是另一种方法，使用 full_join 和相同的 data.frame。在 filter 排除行以避免与自己的 ROA 匹配后，您可以确定 ROA 与同一 year 和 indcode 组中的其他人之间的最小差异。

library(dplyr)

full_join(df, df, by = c("year", "indcode")) %>%
  filter(index.x != index.y) %>%
  group_by(firmcode.x, year, indcode) %>%
  mutate(diff_min_firmcode = firmcode.y[which.min(abs(ROA.x - ROA.y))]) %>%
  distinct(firmcode.x, .keep_all = TRUE)

输出

  index.x firmcode.x  year indcode ROA.x index.y firmcode.y ROA.y diff_min_firmcode
    <int> <chr>      <int>   <int> <dbl>   <int> <chr>      <dbl> <chr>            
1       0 a           2006       3   0.1       1 b            0.2 b                
2       1 b           2006       3   0.2       0 a            0.1 a                
3       2 c           2006       3   0.4       0 a            0.1 b                
4       3 d           2006       3   0.7       0 a            0.1 c                
5       4 e           2006       7   0.3       5 f            0.8 f                
6       5 f           2006       7   0.8       4 e            0.3 g                
7       6 g           2006       7   1.1       4 e            0.3 f                
8       7 h           2006       7   2.1       4 e            0.3 g

Answer 3

这是您可以用于此问题的另一种方法：

library(dplyr)
library(tidyr)
library(purrr)

df %>%
  group_by(year, indcode) %>%
  mutate(output = map(ROA, ~ abs(ROA - .x))) %>%
  unnest_wider(col = output) %>%
  rowwise() %>%
  mutate(inds = which(c_across(contains("...")) == 
                        min(c_across(contains("..."))[c_across(contains("...")) != 0])[1])) %>%
  select(!contains("...")) %>%
  group_by(year, indcode) %>%
  mutate(diff_min = map_chr(inds, ~ firmcode[.x]))

# A tibble: 8 x 7
# Groups:   year, indcode [2]
  index firmcode  year indcode   ROA  inds diff_min
  <int> <chr>    <int>   <int> <dbl> <int> <chr>   
1     0 a         2006       3   0.1     2 b       
2     1 b         2006       3   0.2     1 a       
3     2 c         2006       3   0.4     2 b       
4     3 d         2006       3   0.7     3 c       
5     4 e         2006       7   0.3     2 f       
6     5 f         2006       7   0.8     3 g       
7     6 g         2006       7   1.1     2 f       
8     7 h         2006       7   2.1     3 g

如何匹配 2 标准内的最小差异和 return R 中的索引

How to match the min difference within the 2 criterion and return the index in R

r

matching