R - 根据来自另一个 df 的条件，用重复 ID 按组和列替换 1 df 中的值

Question

我有点卡住了。我有 2 个数据框 - df1 具有唯一的站 ID、按月计算的 % 值以及该站在数据中出现的次数（年）； df2 按年份重复站点 ID，按年份重复每月的值。

df1：表示各站月度非缺失温度数据百分比； n代表该站记录的年数

station_ID  Jan  Feb  Mar ... Dec  n
10160355    37   39   38      39   141
10160360    94   91   98      89   56
10160390    83   87   85      82   163

df2：各站月、年温度数据； n from df1是df2

中重复的station_ID的长度

station_ID  year  Jan  Feb  Mar ... Dec
10160355    1878  NA   10   12      12
10160355    1879  12   12   13      10
...
10160355    2018  14   11   15      14
10160360    1963  12   10   12      14
10160360    1964  10   12   15      11
...
(repeats for all stations & total rows = 277604)

我需要的是：对于每个月度列，如果 df1$station < 50%，则将 df2 中的数据替换为所有行的 NA station/month - 否则，保持 df2 不变。因此，由于 df1$station_ID[1] 仅显示 1 月的 37%，因此该站 (df2$station[1:141]) 的所有 1 月都变为 NA。

我需要的示例输出：

station_ID   year  Jan  Feb ...  Dec
10160355     1878  NA   NA       NA
10160355     1879  NA   NA       NA
...
10160360     1963  12   10       14
10160360     1964  10   12       11
...

我已经尝试了大约 20 种不同的方法，但我认为我需要某种形式的带有 rep 的 dplyr，以便在条件为真时为每个站点的行重复 NA。

最近一次尝试，一次只有一个月，因为我不知道如何做所有的列：

df3 =  df2 %>%
    group_by(station_ID) %>%
    select(Jan) %>%
    mutate(if_else(df1$Jan < 50, rep(NA_character_, df1$n), Jan))

这给出了代表无效 'times' 的错误。我想我可能很接近，但我很感激任何建议！谢谢！

Answer 1

像这样的事情在 "long" 格式中更容易做到 - 尤其是在 dplyr

中

library(dplyr)
library(tidyr)

df1_long = pivot_longer(df1, cols = Jan:Dec, names_to = "month", values_to = "non_missing")
df2_long = pivot_longer(df2, cols = Jan:Dec, names_to = "month", values_to = "temp")

result_long = df2_long %>%
  left_join(df1_long) %>%
  mutate(temp = ifelse(non_missing < 50, NA, temp))

result_long
# # A tibble: 20 x 6
#    station_ID  year month  temp     n non_missing
#         <int> <int> <chr> <int> <int>       <int>
#  1   10160355  1878 Jan      NA   141          37
#  2   10160355  1878 Feb      NA   141          39
#  3   10160355  1878 Mar      NA   141          38
#  4   10160355  1878 Dec      NA   141          39
#  5   10160355  1879 Jan      NA   141          37
#  6   10160355  1879 Feb      NA   141          39
#  7   10160355  1879 Mar      NA   141          38
#  8   10160355  1879 Dec      NA   141          39
#  9   10160355  2018 Jan      NA   141          37
# 10   10160355  2018 Feb      NA   141          39
# 11   10160355  2018 Mar      NA   141          38
# 12   10160355  2018 Dec      NA   141          39
# 13   10160360  1963 Jan      12    56          94
# 14   10160360  1963 Feb      10    56          91
# 15   10160360  1963 Mar      12    56          98
# 16   10160360  1963 Dec      14    56          89
# 17   10160360  1964 Jan      10    56          94
# 18   10160360  1964 Feb      12    56          91
# 19   10160360  1964 Mar      15    56          98
# 20   10160360  1964 Dec      11    56          89

在很多情况下（尤其是制作图表，但也包括建模），我建议您坚持使用这种长格式数据。但是，它可以转换回您原来的宽幅格式：

result_wide = result_long %>%
  select(-n, -non_missing) %>%
  pivot_wider(names_from = "month", values_from = "temp")
result_wide
# # A tibble: 5 x 6
#   station_ID  year   Jan   Feb   Mar   Dec
#        <int> <int> <int> <int> <int> <int>
# 1   10160355  1878    NA    NA    NA    NA
# 2   10160355  1879    NA    NA    NA    NA
# 3   10160355  2018    NA    NA    NA    NA
# 4   10160360  1963    12    10    12    14
# 5   10160360  1964    10    12    15    11

使用此数据：

df1 = read.table(text = 'station_ID  Jan  Feb  Mar  Dec  n
10160355    37   39   38      39   141
10160360    94   91   98      89   56
10160390    83   87   85      82   163', header = T)

df2 = read.table(text = 'station_ID  year  Jan  Feb  Mar  Dec
10160355    1878  NA   10   12      12
10160355    1879  12   12   13      10
10160355    2018  14   11   15      14
10160360    1963  12   10   12      14
10160360    1964  10   12   15      11', header = T)

R - 根据来自另一个 df 的条件，用重复 ID 按组和列替换 1 df 中的值

R - replacing values in 1 df by group and column with repeating IDs based on conditions from another df

r

dataframe

rep

dplyr