Dataframe结构问题及新Dataframe的计算
Dataframe Structure Problems and Calculation of new Dataframe
所以我得到了这个数据框(数据框 1):
# A tibble: 6 x 27
# Groups: authority_dic, Full.Name [6]
authority_dic Full.Name Entity `4_2020` `3_2020` `7_2020` `12_2019` `8_2020` `5_2019` `1_2019` `4_2019` `6_2019` `8_2019` `9_2020` `10_2019` `11_2020` `9_2019`
<chr> <chr> <chr> <int> <int> <int> <int> <int> <int> <int> <int> <int> <int> <int> <int> <int> <int>
1 pro A SALES 23 22 6 3 15 8 7 2 5 8 13 12 7 6
2 pro B WERNE 16 16 7 12 4 11 4 12 13 9 8 9 7 11
3 pro C EQT C 5 16 1 2 3 0 6 2 5 1 2 5 3 5
4 pro D T-MOB 7 7 16 2 8 6 4 10 4 2 6 3 1 2
5 leader A INSPE 2 1 1 15 0 9 1 7 8 4 2 8 2 6
6 pro F AXON 4 4 1 1 0 14 0 12 1 0 0 1 1 0
# ... with 10 more variables: `1_2020` <int>, `11_2019` <int>, `10_2020` <int>, `2_2020` <int>, `2_2019` <int>, `3_2019` <int>, `12_2020` <int>, `7_2019` <int>,
# `6_2020` <int>, `5_2020` <int>
这个数据框(数据框 2):
# A tibble: 6 x 25
# Groups: Full.Name [6]
Full.Name `1_2019` `1_2020` `2_2019` `2_2020` `3_2019` `3_2020` `4_2019` `4_2020` `5_2019` `5_2020` `6_2019` `6_2020` `7_2019` `7_2020` `8_2019` `8_2020` `9_2019`
<chr> <int> <int> <int> <int> <int> <int> <int> <int> <int> <int> <int> <int> <int> <int> <int> <int> <int>
1 A 556 220 577 213 628 102 608 58 547 321 429 233 371 370 338 334 409
2 B 0 48 0 0 0 0 14 45 0 0 83 0 0 16 0 0 0
3 C 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
4 D 0 0 0 0 0 110 0 63 60 125 23 0 46 32 0 47 18
5 E 73 163 68 113 122 0 70 71 135 124 62 101 28 21 98 11 109
6 F 96 135 57 32 149 173 211 145 61 46 190 308 32 159 210 168 0
# ... with 7 more variables: `9_2020` <int>, `10_2019` <int>, `10_2020` <int>, `11_2019` <int>, `11_2020` <int>, `12_2019` <int>, `12_2020` <int>
现在我想创建另一个数据框(数据框 3),它应该包含“全名”列的所有名称,并将数据框 1“x_20xx”的所有列值除以数据框的匹配值2. 这有点棘手,因为不幸的是,两个数据框都有完全相同的列名“x_20xx”。也许我们需要先更改它以计算新数据框的值。
这是我想要的示例:显然包含所有“x_20xx”列。
Full.Name `1_2019_freq`
<chr> <int>
1 A (value_1_2019 df1) / (value_1_2019 df2)
2 B (value_1_2019 df1) / (value_1_2019 df2)
3 C (value_1_2019 df1) / (value_1_2019 df2)
4 D (value_1_2019 df1) / (value_1_2019 df2)
5 E (value_1_2019 df1) / (value_1_2019 df2)
6 F (value_1_2019 df1) / (value_1_2019 df2)
这很难解释,但我希望你明白我的意思。这是一个简单的计算,但我无法将所有“A”和“B”[...] 分组并对它们的列求和,以便只有一行包含“A”和所有其他月份,但每个月份的总和值月。之后,我们只需要创建另一个新的dataframe并将值除以df2的匹配值。
任何帮助表示赞赏。
非常感谢您考虑我的请求。
我的 dput()
输出:
数据框 1:
structure(list(authority_dic = c("pro", "pro", "pro", "pro",
"leader", "pro", "know", "pro", "pro", "pro"), Full.Name = c("Marc R. Benioff",
"Derek J. Leathers", "Robert J. McNally", "G. Michael Sievert",
"Paul J. Sarvadi", "Patrick W. Smith", "John J. Legere", "John J. Legere",
"Patrick K. Decker", "Michael J. Saylor"), Entity = c("SALESFORCE.COM INC",
"WERNER ENTERPRISES INC", "EQT CORP (Rapidan Energy)", "T-MOBILE US INC",
"INSPERITY INC", "AXON ENTERPRISE INC", "T-MOBILE US INC", "T-MOBILE US INC",
"XYLEM INC", "MICROSTRATEGY INC"), `4_2020` = c(23L, 16L, 5L,
7L, 2L, 4L, 9L, 4L, 2L, 2L), `3_2020` = c(22L, 16L, 16L, 7L,
1L, 4L, 8L, 2L, 6L, 0L), `7_2020` = c(6L, 7L, 1L, 16L, 1L, 1L,
3L, 1L, 3L, 0L), `12_2019` = c(3L, 12L, 2L, 2L, 15L, 1L, 4L,
3L, 3L, 0L), `8_2020` = c(15L, 4L, 3L, 8L, 0L, 0L, 4L, 2L, 2L,
0L), `5_2019` = c(8L, 11L, 0L, 6L, 9L, 14L, 7L, 6L, 9L, 0L),
`1_2019` = c(7L, 4L, 6L, 4L, 1L, 0L, 13L, 4L, 0L, 2L), `4_2019` = c(2L,
12L, 2L, 10L, 7L, 12L, 2L, 13L, 2L, 0L), `6_2019` = c(5L,
13L, 5L, 4L, 8L, 1L, 6L, 4L, 6L, 0L), `8_2019` = c(8L, 9L,
1L, 2L, 4L, 0L, 9L, 8L, 13L, 0L), `9_2020` = c(13L, 8L, 2L,
6L, 2L, 0L, 5L, 2L, 0L, 3L), `10_2019` = c(12L, 9L, 5L, 3L,
8L, 1L, 10L, 7L, 9L, 0L), `11_2020` = c(7L, 7L, 3L, 1L, 2L,
1L, 4L, 0L, 3L, 12L), `9_2019` = c(6L, 11L, 5L, 2L, 6L, 0L,
11L, 6L, 6L, 0L), `1_2020` = c(6L, 9L, 1L, 4L, 11L, 5L, 4L,
5L, 3L, 0L), `11_2019` = c(10L, 11L, 0L, 5L, 11L, 1L, 6L,
4L, 4L, 0L), `10_2020` = c(10L, 7L, 3L, 5L, 3L, 0L, 5L, 1L,
1L, 7L), `2_2020` = c(7L, 4L, 3L, 4L, 4L, 0L, 10L, 7L, 5L,
0L), `2_2019` = c(7L, 5L, 5L, 1L, 6L, 0L, 6L, 3L, 0L, 0L),
`3_2019` = c(4L, 9L, 1L, 6L, 6L, 3L, 6L, 10L, 4L, 0L), `12_2020` = c(3L,
10L, 3L, 5L, 0L, 2L, 2L, 1L, 3L, 10L), `7_2019` = c(7L, 9L,
4L, 2L, 6L, 0L, 3L, 8L, 1L, 0L), `6_2020` = c(3L, 3L, 1L,
3L, 1L, 2L, 2L, 2L, 8L, 0L), `5_2020` = c(2L, 7L, 0L, 7L,
4L, 2L, 3L, 4L, 1L, 0L)), class = c("grouped_df", "tbl_df",
"tbl", "data.frame"), row.names = c(NA, -10L), groups = structure(list(
authority_dic = c("know", "leader", "pro", "pro", "pro",
"pro", "pro", "pro", "pro", "pro"), Full.Name = c("John J. Legere",
"Paul J. Sarvadi", "Derek J. Leathers", "G. Michael Sievert",
"John J. Legere", "Marc R. Benioff", "Michael J. Saylor",
"Patrick K. Decker", "Patrick W. Smith", "Robert J. McNally"
), .rows = structure(list(7L, 5L, 2L, 4L, 8L, 1L, 10L, 9L,
6L, 3L), ptype = integer(0), class = c("vctrs_list_of",
"vctrs_vctr", "list"))), class = c("tbl_df", "tbl", "data.frame"
), row.names = c(NA, -10L), .drop = TRUE))
数据框 2:
structure(list(Full.Name = c("A. Patrick Beharelle", "Aaron P. Graft",
"Aaron P. Jagdfeld", "Adam H. Schechter", "Adam P. Symson", "Adena T. Friedman",
"John J. Legere", "Albert Bourla, D.V.M., DVM, Ph.D.",
"Andres Ricardo Gluski Weilert", "Andrew Anagnost"), `1_2019` = c(556L,
0L, 0L, 0L, 73L, 96L, 0L, 0L, 40L, 222L), `1_2020` = c(220L,
48L, 0L, 0L, 163L, 135L, 0L, 209L, 0L, 110L), `2_2019` = c(577L,
0L, 0L, 0L, 68L, 57L, 0L, 0L, 98L, 215L), `2_2020` = c(213L,
0L, 0L, 0L, 113L, 32L, 0L, 66L, 0L, 133L), `3_2019` = c(628L,
0L, 0L, 0L, 122L, 149L, 0L, 0L, 8L, 278L), `3_2020` = c(102L,
0L, 0L, 110L, 0L, 173L, 0L, 243L, 0L, 120L), `4_2019` = c(608L,
14L, 0L, 0L, 70L, 211L, 0L, 0L, 18L, 200L), `4_2020` = c(58L,
45L, 0L, 63L, 71L, 145L, 0L, 315L, 0L, 111L), `5_2019` = c(547L,
0L, 0L, 60L, 135L, 61L, 0L, 0L, 68L, 101L), `5_2020` = c(321L,
0L, 0L, 125L, 124L, 46L, 0L, 100L, 0L, 166L), `6_2019` = c(429L,
83L, 0L, 23L, 62L, 190L, 0L, 0L, 150L, 242L), `6_2020` = c(233L,
0L, 0L, 0L, 101L, 308L, 112L, 128L, 0L, 201L), `7_2019` = c(371L,
0L, 0L, 46L, 28L, 32L, 0L, 111L, 0L, 333L), `7_2020` = c(370L,
16L, 0L, 32L, 21L, 159L, 0L, 199L, 43L, 15L), `8_2019` = c(338L,
0L, 0L, 0L, 98L, 210L, 0L, 56L, 0L, 212L), `8_2020` = c(334L,
0L, 0L, 47L, 11L, 168L, 0L, 207L, 0L, 40L), `9_2019` = c(409L,
0L, 0L, 18L, 109L, 0L, 0L, 73L, 0L, 387L), `9_2020` = c(406L,
91L, 5L, 0L, 40L, 164L, 23L, 183L, 65L, 28L), `10_2019` = c(330L,
35L, 0L, 0L, 63L, 119L, 0L, 192L, 34L, 88L), `10_2020` = c(331L,
0L, 0L, 0L, 25L, 185L, 71L, 451L, 43L, 29L), `11_2019` = c(448L,
33L, 0L, 0L, 131L, 108L, 0L, 41L, 35L, 196L), `11_2020` = c(559L,
0L, 0L, 44L, 83L, 86L, 49L, 803L, 0L, 132L), `12_2019` = c(300L,
0L, 4L, 0L, 72L, 167L, 0L, 155L, 68L, 93L), `12_2020` = c(122L,
0L, 0L, 0L, 0L, 54L, 0L, 790L, 0L, 1L)), class = c("grouped_df",
"tbl_df", "tbl", "data.frame"), row.names = c(NA, -10L), groups = structure(list(
Full.Name = c("A. Patrick Beharelle", "Aaron P. Graft", "Aaron P. Jagdfeld",
"Adam H. Schechter", "Adam P. Symson", "Adena T. Friedman",
"Alan David Schnitzer", "Albert Bourla, D.V.M., DVM, Ph.D.",
"Andres Ricardo Gluski Weilert", "Andrew Anagnost"), .rows = structure(list(
1L, 2L, 3L, 4L, 5L, 6L, 7L, 8L, 9L, 10L), ptype = integer(0), class = c("vctrs_list_of",
"vctrs_vctr", "list"))), class = c("tbl_df", "tbl", "data.frame"
), row.names = c(NA, -10L), .drop = TRUE))
以“John J. Legere”为例
我会为此使用基础 R。此方法仅在结果中包含在两个数据框中都匹配的名称。示例数据中,John Legere是唯一匹配的名字,df2
中对应的值全部为0,R除以0时输出Inf
。
如果有问题需要更多帮助,我建议更新示例数据,以便有更多匹配项,并且有一些非 0 值可以除以。
cols_to_divide = setdiff(intersect(names(df1), names(df2)), "Full.Name")
matching_names = intersect(df1$Full.Name, df2$Full.Name)
# initialize the results with the numerator
df3 = df1[df1$Full.Name %in% matching_names, c("Full.Name", "authority_dic", cols_to_divide)]
# find the right row numbers for the denominator
rows_to_divide = match(df3$Full.Name, df2$Full.Name)
# divide
df3[cols_to_divide] = df3[cols_to_divide] / df2[rows_to_divide, cols_to_divide]
df3
# # A tibble: 2 × 26
# # Groups: authority_dic, Full.Name [2]
# Full.Name authority_dic `4_2020` `3_2020` `7_2020` `12_2019` `8_2020` `5_2019` `1_2019`
# <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
# 1 John J. Legere know Inf Inf Inf Inf Inf Inf Inf
# 2 John J. Legere pro Inf Inf Inf Inf Inf Inf Inf
# # … with 17 more variables: 4_2019 <dbl>, 6_2019 <dbl>, 8_2019 <dbl>, 9_2020 <dbl>,
# # 10_2019 <dbl>, 11_2020 <dbl>, 9_2019 <dbl>, 1_2020 <dbl>, 11_2019 <dbl>, 10_2020 <dbl>,
# # 2_2020 <dbl>, 2_2019 <dbl>, 3_2019 <dbl>, 12_2020 <dbl>, 7_2019 <dbl>, 6_2020 <dbl>,
# # 5_2020 <dbl>
像往常一样,处理宽数据很难,但处理长数据很容易:
inner_join(
pivot_longer(df1, cols=4:27, names_to="period") %>% group_by(Full.Name, period) %>% summarize(value=sum(value,na.rm=T), .groups="drop"),
pivot_longer(df2, cols=2:25, names_to="period") %>% group_by(Full.Name, period) %>% summarize(value=sum(value,na.rm=T), .groups="drop"),
by=c("Full.Name", "period")
) %>%
mutate(result = value.x/value.y) %>%
pivot_wider(id_cols = Full.Name, names_from="period", values_from="result")
如果您想保留 df1
中未出现在 df2
中的名称,请更改为 left_join()
输出:
# A tibble: 1 × 25
Full.Name `1_2019` `1_2020` `10_2019` `10_2020` `11_2019` `11_2020` `12_2019` `12_2020` `2_2019` `2_2020` `3_2019` `3_2020` `4_2019` `4_2020` `5_2019` `5_2020`
<chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 John J. Legere Inf Inf Inf 0.0845 Inf 0.0816 Inf Inf Inf Inf Inf Inf Inf Inf Inf Inf
# … with 8 more variables: `6_2019` <dbl>, `6_2020` <dbl>, `7_2019` <dbl>, `7_2020` <dbl>, `8_2019` <dbl>, `8_2020` <dbl>, `9_2019` <dbl>, `9_2020` <dbl>
请注意,如果该帧中每个 Full.Name 只有一行,则旋转 df2 后的 group_by/summarize 可以被删除。另外,请注意使用 na.rm=T 求和,这可能是您想要的,但也会在 value.y 中产生零,导致很多除以零。
所以我得到了这个数据框(数据框 1):
# A tibble: 6 x 27
# Groups: authority_dic, Full.Name [6]
authority_dic Full.Name Entity `4_2020` `3_2020` `7_2020` `12_2019` `8_2020` `5_2019` `1_2019` `4_2019` `6_2019` `8_2019` `9_2020` `10_2019` `11_2020` `9_2019`
<chr> <chr> <chr> <int> <int> <int> <int> <int> <int> <int> <int> <int> <int> <int> <int> <int> <int>
1 pro A SALES 23 22 6 3 15 8 7 2 5 8 13 12 7 6
2 pro B WERNE 16 16 7 12 4 11 4 12 13 9 8 9 7 11
3 pro C EQT C 5 16 1 2 3 0 6 2 5 1 2 5 3 5
4 pro D T-MOB 7 7 16 2 8 6 4 10 4 2 6 3 1 2
5 leader A INSPE 2 1 1 15 0 9 1 7 8 4 2 8 2 6
6 pro F AXON 4 4 1 1 0 14 0 12 1 0 0 1 1 0
# ... with 10 more variables: `1_2020` <int>, `11_2019` <int>, `10_2020` <int>, `2_2020` <int>, `2_2019` <int>, `3_2019` <int>, `12_2020` <int>, `7_2019` <int>,
# `6_2020` <int>, `5_2020` <int>
这个数据框(数据框 2):
# A tibble: 6 x 25
# Groups: Full.Name [6]
Full.Name `1_2019` `1_2020` `2_2019` `2_2020` `3_2019` `3_2020` `4_2019` `4_2020` `5_2019` `5_2020` `6_2019` `6_2020` `7_2019` `7_2020` `8_2019` `8_2020` `9_2019`
<chr> <int> <int> <int> <int> <int> <int> <int> <int> <int> <int> <int> <int> <int> <int> <int> <int> <int>
1 A 556 220 577 213 628 102 608 58 547 321 429 233 371 370 338 334 409
2 B 0 48 0 0 0 0 14 45 0 0 83 0 0 16 0 0 0
3 C 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
4 D 0 0 0 0 0 110 0 63 60 125 23 0 46 32 0 47 18
5 E 73 163 68 113 122 0 70 71 135 124 62 101 28 21 98 11 109
6 F 96 135 57 32 149 173 211 145 61 46 190 308 32 159 210 168 0
# ... with 7 more variables: `9_2020` <int>, `10_2019` <int>, `10_2020` <int>, `11_2019` <int>, `11_2020` <int>, `12_2019` <int>, `12_2020` <int>
现在我想创建另一个数据框(数据框 3),它应该包含“全名”列的所有名称,并将数据框 1“x_20xx”的所有列值除以数据框的匹配值2. 这有点棘手,因为不幸的是,两个数据框都有完全相同的列名“x_20xx”。也许我们需要先更改它以计算新数据框的值。
这是我想要的示例:显然包含所有“x_20xx”列。
Full.Name `1_2019_freq`
<chr> <int>
1 A (value_1_2019 df1) / (value_1_2019 df2)
2 B (value_1_2019 df1) / (value_1_2019 df2)
3 C (value_1_2019 df1) / (value_1_2019 df2)
4 D (value_1_2019 df1) / (value_1_2019 df2)
5 E (value_1_2019 df1) / (value_1_2019 df2)
6 F (value_1_2019 df1) / (value_1_2019 df2)
这很难解释,但我希望你明白我的意思。这是一个简单的计算,但我无法将所有“A”和“B”[...] 分组并对它们的列求和,以便只有一行包含“A”和所有其他月份,但每个月份的总和值月。之后,我们只需要创建另一个新的dataframe并将值除以df2的匹配值。
任何帮助表示赞赏。
非常感谢您考虑我的请求。
我的 dput()
输出:
数据框 1:
structure(list(authority_dic = c("pro", "pro", "pro", "pro",
"leader", "pro", "know", "pro", "pro", "pro"), Full.Name = c("Marc R. Benioff",
"Derek J. Leathers", "Robert J. McNally", "G. Michael Sievert",
"Paul J. Sarvadi", "Patrick W. Smith", "John J. Legere", "John J. Legere",
"Patrick K. Decker", "Michael J. Saylor"), Entity = c("SALESFORCE.COM INC",
"WERNER ENTERPRISES INC", "EQT CORP (Rapidan Energy)", "T-MOBILE US INC",
"INSPERITY INC", "AXON ENTERPRISE INC", "T-MOBILE US INC", "T-MOBILE US INC",
"XYLEM INC", "MICROSTRATEGY INC"), `4_2020` = c(23L, 16L, 5L,
7L, 2L, 4L, 9L, 4L, 2L, 2L), `3_2020` = c(22L, 16L, 16L, 7L,
1L, 4L, 8L, 2L, 6L, 0L), `7_2020` = c(6L, 7L, 1L, 16L, 1L, 1L,
3L, 1L, 3L, 0L), `12_2019` = c(3L, 12L, 2L, 2L, 15L, 1L, 4L,
3L, 3L, 0L), `8_2020` = c(15L, 4L, 3L, 8L, 0L, 0L, 4L, 2L, 2L,
0L), `5_2019` = c(8L, 11L, 0L, 6L, 9L, 14L, 7L, 6L, 9L, 0L),
`1_2019` = c(7L, 4L, 6L, 4L, 1L, 0L, 13L, 4L, 0L, 2L), `4_2019` = c(2L,
12L, 2L, 10L, 7L, 12L, 2L, 13L, 2L, 0L), `6_2019` = c(5L,
13L, 5L, 4L, 8L, 1L, 6L, 4L, 6L, 0L), `8_2019` = c(8L, 9L,
1L, 2L, 4L, 0L, 9L, 8L, 13L, 0L), `9_2020` = c(13L, 8L, 2L,
6L, 2L, 0L, 5L, 2L, 0L, 3L), `10_2019` = c(12L, 9L, 5L, 3L,
8L, 1L, 10L, 7L, 9L, 0L), `11_2020` = c(7L, 7L, 3L, 1L, 2L,
1L, 4L, 0L, 3L, 12L), `9_2019` = c(6L, 11L, 5L, 2L, 6L, 0L,
11L, 6L, 6L, 0L), `1_2020` = c(6L, 9L, 1L, 4L, 11L, 5L, 4L,
5L, 3L, 0L), `11_2019` = c(10L, 11L, 0L, 5L, 11L, 1L, 6L,
4L, 4L, 0L), `10_2020` = c(10L, 7L, 3L, 5L, 3L, 0L, 5L, 1L,
1L, 7L), `2_2020` = c(7L, 4L, 3L, 4L, 4L, 0L, 10L, 7L, 5L,
0L), `2_2019` = c(7L, 5L, 5L, 1L, 6L, 0L, 6L, 3L, 0L, 0L),
`3_2019` = c(4L, 9L, 1L, 6L, 6L, 3L, 6L, 10L, 4L, 0L), `12_2020` = c(3L,
10L, 3L, 5L, 0L, 2L, 2L, 1L, 3L, 10L), `7_2019` = c(7L, 9L,
4L, 2L, 6L, 0L, 3L, 8L, 1L, 0L), `6_2020` = c(3L, 3L, 1L,
3L, 1L, 2L, 2L, 2L, 8L, 0L), `5_2020` = c(2L, 7L, 0L, 7L,
4L, 2L, 3L, 4L, 1L, 0L)), class = c("grouped_df", "tbl_df",
"tbl", "data.frame"), row.names = c(NA, -10L), groups = structure(list(
authority_dic = c("know", "leader", "pro", "pro", "pro",
"pro", "pro", "pro", "pro", "pro"), Full.Name = c("John J. Legere",
"Paul J. Sarvadi", "Derek J. Leathers", "G. Michael Sievert",
"John J. Legere", "Marc R. Benioff", "Michael J. Saylor",
"Patrick K. Decker", "Patrick W. Smith", "Robert J. McNally"
), .rows = structure(list(7L, 5L, 2L, 4L, 8L, 1L, 10L, 9L,
6L, 3L), ptype = integer(0), class = c("vctrs_list_of",
"vctrs_vctr", "list"))), class = c("tbl_df", "tbl", "data.frame"
), row.names = c(NA, -10L), .drop = TRUE))
数据框 2:
structure(list(Full.Name = c("A. Patrick Beharelle", "Aaron P. Graft",
"Aaron P. Jagdfeld", "Adam H. Schechter", "Adam P. Symson", "Adena T. Friedman",
"John J. Legere", "Albert Bourla, D.V.M., DVM, Ph.D.",
"Andres Ricardo Gluski Weilert", "Andrew Anagnost"), `1_2019` = c(556L,
0L, 0L, 0L, 73L, 96L, 0L, 0L, 40L, 222L), `1_2020` = c(220L,
48L, 0L, 0L, 163L, 135L, 0L, 209L, 0L, 110L), `2_2019` = c(577L,
0L, 0L, 0L, 68L, 57L, 0L, 0L, 98L, 215L), `2_2020` = c(213L,
0L, 0L, 0L, 113L, 32L, 0L, 66L, 0L, 133L), `3_2019` = c(628L,
0L, 0L, 0L, 122L, 149L, 0L, 0L, 8L, 278L), `3_2020` = c(102L,
0L, 0L, 110L, 0L, 173L, 0L, 243L, 0L, 120L), `4_2019` = c(608L,
14L, 0L, 0L, 70L, 211L, 0L, 0L, 18L, 200L), `4_2020` = c(58L,
45L, 0L, 63L, 71L, 145L, 0L, 315L, 0L, 111L), `5_2019` = c(547L,
0L, 0L, 60L, 135L, 61L, 0L, 0L, 68L, 101L), `5_2020` = c(321L,
0L, 0L, 125L, 124L, 46L, 0L, 100L, 0L, 166L), `6_2019` = c(429L,
83L, 0L, 23L, 62L, 190L, 0L, 0L, 150L, 242L), `6_2020` = c(233L,
0L, 0L, 0L, 101L, 308L, 112L, 128L, 0L, 201L), `7_2019` = c(371L,
0L, 0L, 46L, 28L, 32L, 0L, 111L, 0L, 333L), `7_2020` = c(370L,
16L, 0L, 32L, 21L, 159L, 0L, 199L, 43L, 15L), `8_2019` = c(338L,
0L, 0L, 0L, 98L, 210L, 0L, 56L, 0L, 212L), `8_2020` = c(334L,
0L, 0L, 47L, 11L, 168L, 0L, 207L, 0L, 40L), `9_2019` = c(409L,
0L, 0L, 18L, 109L, 0L, 0L, 73L, 0L, 387L), `9_2020` = c(406L,
91L, 5L, 0L, 40L, 164L, 23L, 183L, 65L, 28L), `10_2019` = c(330L,
35L, 0L, 0L, 63L, 119L, 0L, 192L, 34L, 88L), `10_2020` = c(331L,
0L, 0L, 0L, 25L, 185L, 71L, 451L, 43L, 29L), `11_2019` = c(448L,
33L, 0L, 0L, 131L, 108L, 0L, 41L, 35L, 196L), `11_2020` = c(559L,
0L, 0L, 44L, 83L, 86L, 49L, 803L, 0L, 132L), `12_2019` = c(300L,
0L, 4L, 0L, 72L, 167L, 0L, 155L, 68L, 93L), `12_2020` = c(122L,
0L, 0L, 0L, 0L, 54L, 0L, 790L, 0L, 1L)), class = c("grouped_df",
"tbl_df", "tbl", "data.frame"), row.names = c(NA, -10L), groups = structure(list(
Full.Name = c("A. Patrick Beharelle", "Aaron P. Graft", "Aaron P. Jagdfeld",
"Adam H. Schechter", "Adam P. Symson", "Adena T. Friedman",
"Alan David Schnitzer", "Albert Bourla, D.V.M., DVM, Ph.D.",
"Andres Ricardo Gluski Weilert", "Andrew Anagnost"), .rows = structure(list(
1L, 2L, 3L, 4L, 5L, 6L, 7L, 8L, 9L, 10L), ptype = integer(0), class = c("vctrs_list_of",
"vctrs_vctr", "list"))), class = c("tbl_df", "tbl", "data.frame"
), row.names = c(NA, -10L), .drop = TRUE))
以“John J. Legere”为例
我会为此使用基础 R。此方法仅在结果中包含在两个数据框中都匹配的名称。示例数据中,John Legere是唯一匹配的名字,df2
中对应的值全部为0,R除以0时输出Inf
。
如果有问题需要更多帮助,我建议更新示例数据,以便有更多匹配项,并且有一些非 0 值可以除以。
cols_to_divide = setdiff(intersect(names(df1), names(df2)), "Full.Name")
matching_names = intersect(df1$Full.Name, df2$Full.Name)
# initialize the results with the numerator
df3 = df1[df1$Full.Name %in% matching_names, c("Full.Name", "authority_dic", cols_to_divide)]
# find the right row numbers for the denominator
rows_to_divide = match(df3$Full.Name, df2$Full.Name)
# divide
df3[cols_to_divide] = df3[cols_to_divide] / df2[rows_to_divide, cols_to_divide]
df3
# # A tibble: 2 × 26
# # Groups: authority_dic, Full.Name [2]
# Full.Name authority_dic `4_2020` `3_2020` `7_2020` `12_2019` `8_2020` `5_2019` `1_2019`
# <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
# 1 John J. Legere know Inf Inf Inf Inf Inf Inf Inf
# 2 John J. Legere pro Inf Inf Inf Inf Inf Inf Inf
# # … with 17 more variables: 4_2019 <dbl>, 6_2019 <dbl>, 8_2019 <dbl>, 9_2020 <dbl>,
# # 10_2019 <dbl>, 11_2020 <dbl>, 9_2019 <dbl>, 1_2020 <dbl>, 11_2019 <dbl>, 10_2020 <dbl>,
# # 2_2020 <dbl>, 2_2019 <dbl>, 3_2019 <dbl>, 12_2020 <dbl>, 7_2019 <dbl>, 6_2020 <dbl>,
# # 5_2020 <dbl>
像往常一样,处理宽数据很难,但处理长数据很容易:
inner_join(
pivot_longer(df1, cols=4:27, names_to="period") %>% group_by(Full.Name, period) %>% summarize(value=sum(value,na.rm=T), .groups="drop"),
pivot_longer(df2, cols=2:25, names_to="period") %>% group_by(Full.Name, period) %>% summarize(value=sum(value,na.rm=T), .groups="drop"),
by=c("Full.Name", "period")
) %>%
mutate(result = value.x/value.y) %>%
pivot_wider(id_cols = Full.Name, names_from="period", values_from="result")
如果您想保留 df1
中未出现在 df2
left_join()
输出:
# A tibble: 1 × 25
Full.Name `1_2019` `1_2020` `10_2019` `10_2020` `11_2019` `11_2020` `12_2019` `12_2020` `2_2019` `2_2020` `3_2019` `3_2020` `4_2019` `4_2020` `5_2019` `5_2020`
<chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 John J. Legere Inf Inf Inf 0.0845 Inf 0.0816 Inf Inf Inf Inf Inf Inf Inf Inf Inf Inf
# … with 8 more variables: `6_2019` <dbl>, `6_2020` <dbl>, `7_2019` <dbl>, `7_2020` <dbl>, `8_2019` <dbl>, `8_2020` <dbl>, `9_2019` <dbl>, `9_2020` <dbl>
请注意,如果该帧中每个 Full.Name 只有一行,则旋转 df2 后的 group_by/summarize 可以被删除。另外,请注意使用 na.rm=T 求和,这可能是您想要的,但也会在 value.y 中产生零,导致很多除以零。