Dataframe结构问题及新Dataframe的计算

Question

所以我得到了这个数据框（数据框 1）：

# A tibble: 6 x 27
# Groups:   authority_dic, Full.Name [6]
  authority_dic Full.Name     Entity `4_2020` `3_2020` `7_2020` `12_2019` `8_2020` `5_2019` `1_2019` `4_2019` `6_2019` `8_2019` `9_2020` `10_2019` `11_2020` `9_2019`
  <chr>         <chr>         <chr>     <int>    <int>    <int>     <int>    <int>    <int>    <int>    <int>    <int>    <int>    <int>     <int>     <int>    <int>
1 pro           A              SALES     23       22        6         3       15        8        7        2        5        8       13        12         7        6
2 pro           B              WERNE     16       16        7        12        4       11        4       12       13        9        8         9         7       11
3 pro           C              EQT C      5       16        1         2        3        0        6        2        5        1        2         5         3        5
4 pro           D              T-MOB      7        7       16         2        8        6        4       10        4        2        6         3         1        2
5 leader        A              INSPE      2        1        1        15        0        9        1        7        8        4        2         8         2        6
6 pro           F              AXON       4        4        1         1        0       14        0       12        1        0        0         1         1        0
# ... with 10 more variables: `1_2020` <int>, `11_2019` <int>, `10_2020` <int>, `2_2020` <int>, `2_2019` <int>, `3_2019` <int>, `12_2020` <int>, `7_2019` <int>,
#   `6_2020` <int>, `5_2020` <int>

这个数据框（数据框 2）：

# A tibble: 6 x 25
# Groups:   Full.Name [6]
  Full.Name  `1_2019` `1_2020` `2_2019` `2_2020` `3_2019` `3_2020` `4_2019` `4_2020` `5_2019` `5_2020` `6_2019` `6_2020` `7_2019` `7_2020` `8_2019` `8_2020` `9_2019`
  <chr>         <int>    <int>    <int>    <int>    <int>    <int>    <int>    <int>    <int>    <int>    <int>    <int>    <int>    <int>    <int>    <int>    <int>
1 A               556      220      577      213      628      102      608       58      547      321      429      233      371      370      338      334      409
2 B                 0       48        0        0        0        0       14       45        0        0       83        0        0       16        0        0        0
3 C                 0        0        0        0        0        0        0        0        0        0        0        0        0        0        0        0        0
4 D                 0        0        0        0        0      110        0       63       60      125       23        0       46       32        0       47       18
5 E                73      163       68      113      122        0       70       71      135      124       62      101       28       21       98       11      109
6 F                96      135       57       32      149      173      211      145       61       46      190      308       32      159      210      168        0
# ... with 7 more variables: `9_2020` <int>, `10_2019` <int>, `10_2020` <int>, `11_2019` <int>, `11_2020` <int>, `12_2019` <int>, `12_2020` <int>

现在我想创建另一个数据框（数据框 3），它应该包含“全名”列的所有名称，并将数据框 1“x_20xx”的所有列值除以数据框的匹配值2. 这有点棘手，因为不幸的是，两个数据框都有完全相同的列名“x_20xx”。也许我们需要先更改它以计算新数据框的值。

这是我想要的示例：显然包含所有“x_20xx”列。

  Full.Name                          `1_2019_freq` 
  <chr>                                  <int>    
1 A                      (value_1_2019 df1) / (value_1_2019 df2)     
2 B                      (value_1_2019 df1) / (value_1_2019 df2)
3 C                      (value_1_2019 df1) / (value_1_2019 df2)
4 D                      (value_1_2019 df1) / (value_1_2019 df2)
5 E                      (value_1_2019 df1) / (value_1_2019 df2)
6 F                      (value_1_2019 df1) / (value_1_2019 df2)

这很难解释，但我希望你明白我的意思。这是一个简单的计算，但我无法将所有“A”和“B”[...] 分组并对它们的列求和，以便只有一行包含“A”和所有其他月份，但每个月份的总和值月。之后，我们只需要创建另一个新的dataframe并将值除以df2的匹配值。任何帮助表示赞赏。非常感谢您考虑我的请求。我的 dput() 输出：数据框 1：

structure(list(authority_dic = c("pro", "pro", "pro", "pro", 
"leader", "pro", "know", "pro", "pro", "pro"), Full.Name = c("Marc R. Benioff", 
"Derek J. Leathers", "Robert J. McNally", "G. Michael Sievert", 
"Paul J. Sarvadi", "Patrick W. Smith", "John J. Legere", "John J. Legere", 
"Patrick K. Decker", "Michael J. Saylor"), Entity = c("SALESFORCE.COM INC", 
"WERNER ENTERPRISES INC", "EQT CORP (Rapidan Energy)", "T-MOBILE US INC", 
"INSPERITY INC", "AXON ENTERPRISE INC", "T-MOBILE US INC", "T-MOBILE US INC", 
"XYLEM INC", "MICROSTRATEGY INC"), `4_2020` = c(23L, 16L, 5L, 
7L, 2L, 4L, 9L, 4L, 2L, 2L), `3_2020` = c(22L, 16L, 16L, 7L, 
1L, 4L, 8L, 2L, 6L, 0L), `7_2020` = c(6L, 7L, 1L, 16L, 1L, 1L, 
3L, 1L, 3L, 0L), `12_2019` = c(3L, 12L, 2L, 2L, 15L, 1L, 4L, 
3L, 3L, 0L), `8_2020` = c(15L, 4L, 3L, 8L, 0L, 0L, 4L, 2L, 2L, 
0L), `5_2019` = c(8L, 11L, 0L, 6L, 9L, 14L, 7L, 6L, 9L, 0L), 
    `1_2019` = c(7L, 4L, 6L, 4L, 1L, 0L, 13L, 4L, 0L, 2L), `4_2019` = c(2L, 
    12L, 2L, 10L, 7L, 12L, 2L, 13L, 2L, 0L), `6_2019` = c(5L, 
    13L, 5L, 4L, 8L, 1L, 6L, 4L, 6L, 0L), `8_2019` = c(8L, 9L, 
    1L, 2L, 4L, 0L, 9L, 8L, 13L, 0L), `9_2020` = c(13L, 8L, 2L, 
    6L, 2L, 0L, 5L, 2L, 0L, 3L), `10_2019` = c(12L, 9L, 5L, 3L, 
    8L, 1L, 10L, 7L, 9L, 0L), `11_2020` = c(7L, 7L, 3L, 1L, 2L, 
    1L, 4L, 0L, 3L, 12L), `9_2019` = c(6L, 11L, 5L, 2L, 6L, 0L, 
    11L, 6L, 6L, 0L), `1_2020` = c(6L, 9L, 1L, 4L, 11L, 5L, 4L, 
    5L, 3L, 0L), `11_2019` = c(10L, 11L, 0L, 5L, 11L, 1L, 6L, 
    4L, 4L, 0L), `10_2020` = c(10L, 7L, 3L, 5L, 3L, 0L, 5L, 1L, 
    1L, 7L), `2_2020` = c(7L, 4L, 3L, 4L, 4L, 0L, 10L, 7L, 5L, 
    0L), `2_2019` = c(7L, 5L, 5L, 1L, 6L, 0L, 6L, 3L, 0L, 0L), 
    `3_2019` = c(4L, 9L, 1L, 6L, 6L, 3L, 6L, 10L, 4L, 0L), `12_2020` = c(3L, 
    10L, 3L, 5L, 0L, 2L, 2L, 1L, 3L, 10L), `7_2019` = c(7L, 9L, 
    4L, 2L, 6L, 0L, 3L, 8L, 1L, 0L), `6_2020` = c(3L, 3L, 1L, 
    3L, 1L, 2L, 2L, 2L, 8L, 0L), `5_2020` = c(2L, 7L, 0L, 7L, 
    4L, 2L, 3L, 4L, 1L, 0L)), class = c("grouped_df", "tbl_df", 
"tbl", "data.frame"), row.names = c(NA, -10L), groups = structure(list(
    authority_dic = c("know", "leader", "pro", "pro", "pro", 
    "pro", "pro", "pro", "pro", "pro"), Full.Name = c("John J. Legere", 
    "Paul J. Sarvadi", "Derek J. Leathers", "G. Michael Sievert", 
    "John J. Legere", "Marc R. Benioff", "Michael J. Saylor", 
    "Patrick K. Decker", "Patrick W. Smith", "Robert J. McNally"
    ), .rows = structure(list(7L, 5L, 2L, 4L, 8L, 1L, 10L, 9L, 
        6L, 3L), ptype = integer(0), class = c("vctrs_list_of", 
    "vctrs_vctr", "list"))), class = c("tbl_df", "tbl", "data.frame"
), row.names = c(NA, -10L), .drop = TRUE))

数据框 2：

structure(list(Full.Name = c("A. Patrick Beharelle", "Aaron P. Graft", 
"Aaron P. Jagdfeld", "Adam H. Schechter", "Adam P. Symson", "Adena T. Friedman", 
"John J. Legere", "Albert Bourla, D.V.M., DVM, Ph.D.", 
"Andres Ricardo Gluski Weilert", "Andrew Anagnost"), `1_2019` = c(556L, 
0L, 0L, 0L, 73L, 96L, 0L, 0L, 40L, 222L), `1_2020` = c(220L, 
48L, 0L, 0L, 163L, 135L, 0L, 209L, 0L, 110L), `2_2019` = c(577L, 
0L, 0L, 0L, 68L, 57L, 0L, 0L, 98L, 215L), `2_2020` = c(213L, 
0L, 0L, 0L, 113L, 32L, 0L, 66L, 0L, 133L), `3_2019` = c(628L, 
0L, 0L, 0L, 122L, 149L, 0L, 0L, 8L, 278L), `3_2020` = c(102L, 
0L, 0L, 110L, 0L, 173L, 0L, 243L, 0L, 120L), `4_2019` = c(608L, 
14L, 0L, 0L, 70L, 211L, 0L, 0L, 18L, 200L), `4_2020` = c(58L, 
45L, 0L, 63L, 71L, 145L, 0L, 315L, 0L, 111L), `5_2019` = c(547L, 
0L, 0L, 60L, 135L, 61L, 0L, 0L, 68L, 101L), `5_2020` = c(321L, 
0L, 0L, 125L, 124L, 46L, 0L, 100L, 0L, 166L), `6_2019` = c(429L, 
83L, 0L, 23L, 62L, 190L, 0L, 0L, 150L, 242L), `6_2020` = c(233L, 
0L, 0L, 0L, 101L, 308L, 112L, 128L, 0L, 201L), `7_2019` = c(371L, 
0L, 0L, 46L, 28L, 32L, 0L, 111L, 0L, 333L), `7_2020` = c(370L, 
16L, 0L, 32L, 21L, 159L, 0L, 199L, 43L, 15L), `8_2019` = c(338L, 
0L, 0L, 0L, 98L, 210L, 0L, 56L, 0L, 212L), `8_2020` = c(334L, 
0L, 0L, 47L, 11L, 168L, 0L, 207L, 0L, 40L), `9_2019` = c(409L, 
0L, 0L, 18L, 109L, 0L, 0L, 73L, 0L, 387L), `9_2020` = c(406L, 
91L, 5L, 0L, 40L, 164L, 23L, 183L, 65L, 28L), `10_2019` = c(330L, 
35L, 0L, 0L, 63L, 119L, 0L, 192L, 34L, 88L), `10_2020` = c(331L, 
0L, 0L, 0L, 25L, 185L, 71L, 451L, 43L, 29L), `11_2019` = c(448L, 
33L, 0L, 0L, 131L, 108L, 0L, 41L, 35L, 196L), `11_2020` = c(559L, 
0L, 0L, 44L, 83L, 86L, 49L, 803L, 0L, 132L), `12_2019` = c(300L, 
0L, 4L, 0L, 72L, 167L, 0L, 155L, 68L, 93L), `12_2020` = c(122L, 
0L, 0L, 0L, 0L, 54L, 0L, 790L, 0L, 1L)), class = c("grouped_df", 
"tbl_df", "tbl", "data.frame"), row.names = c(NA, -10L), groups = structure(list(
    Full.Name = c("A. Patrick Beharelle", "Aaron P. Graft", "Aaron P. Jagdfeld", 
    "Adam H. Schechter", "Adam P. Symson", "Adena T. Friedman", 
    "Alan David Schnitzer", "Albert Bourla, D.V.M., DVM, Ph.D.", 
    "Andres Ricardo Gluski Weilert", "Andrew Anagnost"), .rows = structure(list(
        1L, 2L, 3L, 4L, 5L, 6L, 7L, 8L, 9L, 10L), ptype = integer(0), class = c("vctrs_list_of", 
    "vctrs_vctr", "list"))), class = c("tbl_df", "tbl", "data.frame"
), row.names = c(NA, -10L), .drop = TRUE))

以“John J. Legere”为例

Answer 1

我会为此使用基础 R。此方法仅在结果中包含在两个数据框中都匹配的名称。示例数据中，John Legere是唯一匹配的名字，df2中对应的值全部为0，R除以0时输出Inf。

如果有问题需要更多帮助，我建议更新示例数据，以便有更多匹配项，并且有一些非 0 值可以除以。

cols_to_divide = setdiff(intersect(names(df1), names(df2)), "Full.Name")
matching_names = intersect(df1$Full.Name, df2$Full.Name)
# initialize the results with the numerator
df3 = df1[df1$Full.Name %in% matching_names, c("Full.Name", "authority_dic", cols_to_divide)]
# find the right row numbers for the denominator
rows_to_divide = match(df3$Full.Name, df2$Full.Name)
# divide
df3[cols_to_divide] = df3[cols_to_divide] / df2[rows_to_divide, cols_to_divide]

df3
# # A tibble: 2 × 26
# # Groups:   authority_dic, Full.Name [2]
#   Full.Name      authority_dic `4_2020` `3_2020` `7_2020` `12_2019` `8_2020` `5_2019` `1_2019`
#   <chr>          <chr>            <dbl>    <dbl>    <dbl>     <dbl>    <dbl>    <dbl>    <dbl>
# 1 John J. Legere know               Inf      Inf      Inf       Inf      Inf      Inf      Inf
# 2 John J. Legere pro                Inf      Inf      Inf       Inf      Inf      Inf      Inf
# # … with 17 more variables: 4_2019 <dbl>, 6_2019 <dbl>, 8_2019 <dbl>, 9_2020 <dbl>,
# #   10_2019 <dbl>, 11_2020 <dbl>, 9_2019 <dbl>, 1_2020 <dbl>, 11_2019 <dbl>, 10_2020 <dbl>,
# #   2_2020 <dbl>, 2_2019 <dbl>, 3_2019 <dbl>, 12_2020 <dbl>, 7_2019 <dbl>, 6_2020 <dbl>,
# #   5_2020 <dbl>

Answer 2

像往常一样，处理宽数据很难，但处理长数据很容易：

inner_join(
  pivot_longer(df1, cols=4:27, names_to="period") %>% group_by(Full.Name, period) %>% summarize(value=sum(value,na.rm=T), .groups="drop"),
  pivot_longer(df2, cols=2:25, names_to="period") %>% group_by(Full.Name, period) %>% summarize(value=sum(value,na.rm=T), .groups="drop"), 
  by=c("Full.Name", "period")
) %>% 
  mutate(result = value.x/value.y) %>% 
  pivot_wider(id_cols = Full.Name, names_from="period", values_from="result")

如果您想保留 df1 中未出现在 df2

中的名称，请更改为 left_join()

输出：

# A tibble: 1 × 25
  Full.Name      `1_2019` `1_2020` `10_2019` `10_2020` `11_2019` `11_2020` `12_2019` `12_2020` `2_2019` `2_2020` `3_2019` `3_2020` `4_2019` `4_2020` `5_2019` `5_2020`
  <chr>             <dbl>    <dbl>     <dbl>     <dbl>     <dbl>     <dbl>     <dbl>     <dbl>    <dbl>    <dbl>    <dbl>    <dbl>    <dbl>    <dbl>    <dbl>    <dbl>
1 John J. Legere      Inf      Inf       Inf    0.0845       Inf    0.0816       Inf       Inf      Inf      Inf      Inf      Inf      Inf      Inf      Inf      Inf
# … with 8 more variables: `6_2019` <dbl>, `6_2020` <dbl>, `7_2019` <dbl>, `7_2020` <dbl>, `8_2019` <dbl>, `8_2020` <dbl>, `9_2019` <dbl>, `9_2020` <dbl>

请注意，如果该帧中每个 Full.Name 只有一行，则旋转 df2 后的 group_by/summarize 可以被删除。另外，请注意使用 na.rm=T 求和，这可能是您想要的，但也会在 value.y 中产生零，导致很多除以零。

Dataframe结构问题及新Dataframe的计算

Dataframe Structure Problems and Calculation of new Dataframe

r

dplyr

tidyr

tidyverse