R - 根据来自另一个 df 的条件,用重复 ID 按组和列替换 1 df 中的值
R - replacing values in 1 df by group and column with repeating IDs based on conditions from another df
我有点卡住了。我有 2 个数据框 - df1 具有唯一的站 ID、按月计算的 % 值以及该站在数据中出现的次数(年); df2 按年份重复站点 ID,按年份重复每月的值。
df1:表示各站月度非缺失温度数据百分比; n代表该站记录的年数
station_ID Jan Feb Mar ... Dec n
10160355 37 39 38 39 141
10160360 94 91 98 89 56
10160390 83 87 85 82 163
df2:各站月、年温度数据; n from df1是df2
中重复的station_ID的长度
station_ID year Jan Feb Mar ... Dec
10160355 1878 NA 10 12 12
10160355 1879 12 12 13 10
...
10160355 2018 14 11 15 14
10160360 1963 12 10 12 14
10160360 1964 10 12 15 11
...
(repeats for all stations & total rows = 277604)
我需要的是:对于每个月度列,如果 df1$station < 50%,则将 df2 中的数据替换为所有行的 NA station/month - 否则,保持 df2 不变。因此,由于 df1$station_ID[1] 仅显示 1 月的 37%,因此该站 (df2$station[1:141]) 的所有 1 月都变为 NA。
我需要的示例输出:
station_ID year Jan Feb ... Dec
10160355 1878 NA NA NA
10160355 1879 NA NA NA
...
10160360 1963 12 10 14
10160360 1964 10 12 11
...
我已经尝试了大约 20 种不同的方法,但我认为我需要某种形式的带有 rep 的 dplyr,以便在条件为真时为每个站点的行重复 NA。
最近一次尝试,一次只有一个月,因为我不知道如何做所有的列:
df3 = df2 %>%
group_by(station_ID) %>%
select(Jan) %>%
mutate(if_else(df1$Jan < 50, rep(NA_character_, df1$n), Jan))
这给出了代表无效 'times' 的错误。我想我可能很接近,但我很感激任何建议!谢谢!
像这样的事情在 "long" 格式中更容易做到 - 尤其是在 dplyr
中
library(dplyr)
library(tidyr)
df1_long = pivot_longer(df1, cols = Jan:Dec, names_to = "month", values_to = "non_missing")
df2_long = pivot_longer(df2, cols = Jan:Dec, names_to = "month", values_to = "temp")
result_long = df2_long %>%
left_join(df1_long) %>%
mutate(temp = ifelse(non_missing < 50, NA, temp))
result_long
# # A tibble: 20 x 6
# station_ID year month temp n non_missing
# <int> <int> <chr> <int> <int> <int>
# 1 10160355 1878 Jan NA 141 37
# 2 10160355 1878 Feb NA 141 39
# 3 10160355 1878 Mar NA 141 38
# 4 10160355 1878 Dec NA 141 39
# 5 10160355 1879 Jan NA 141 37
# 6 10160355 1879 Feb NA 141 39
# 7 10160355 1879 Mar NA 141 38
# 8 10160355 1879 Dec NA 141 39
# 9 10160355 2018 Jan NA 141 37
# 10 10160355 2018 Feb NA 141 39
# 11 10160355 2018 Mar NA 141 38
# 12 10160355 2018 Dec NA 141 39
# 13 10160360 1963 Jan 12 56 94
# 14 10160360 1963 Feb 10 56 91
# 15 10160360 1963 Mar 12 56 98
# 16 10160360 1963 Dec 14 56 89
# 17 10160360 1964 Jan 10 56 94
# 18 10160360 1964 Feb 12 56 91
# 19 10160360 1964 Mar 15 56 98
# 20 10160360 1964 Dec 11 56 89
在很多情况下(尤其是制作图表,但也包括建模),我建议您坚持使用这种长格式数据。但是,它可以转换回您原来的宽幅格式:
result_wide = result_long %>%
select(-n, -non_missing) %>%
pivot_wider(names_from = "month", values_from = "temp")
result_wide
# # A tibble: 5 x 6
# station_ID year Jan Feb Mar Dec
# <int> <int> <int> <int> <int> <int>
# 1 10160355 1878 NA NA NA NA
# 2 10160355 1879 NA NA NA NA
# 3 10160355 2018 NA NA NA NA
# 4 10160360 1963 12 10 12 14
# 5 10160360 1964 10 12 15 11
使用此数据:
df1 = read.table(text = 'station_ID Jan Feb Mar Dec n
10160355 37 39 38 39 141
10160360 94 91 98 89 56
10160390 83 87 85 82 163', header = T)
df2 = read.table(text = 'station_ID year Jan Feb Mar Dec
10160355 1878 NA 10 12 12
10160355 1879 12 12 13 10
10160355 2018 14 11 15 14
10160360 1963 12 10 12 14
10160360 1964 10 12 15 11', header = T)
我有点卡住了。我有 2 个数据框 - df1 具有唯一的站 ID、按月计算的 % 值以及该站在数据中出现的次数(年); df2 按年份重复站点 ID,按年份重复每月的值。
df1:表示各站月度非缺失温度数据百分比; n代表该站记录的年数
station_ID Jan Feb Mar ... Dec n
10160355 37 39 38 39 141
10160360 94 91 98 89 56
10160390 83 87 85 82 163
df2:各站月、年温度数据; n from df1是df2
中重复的station_ID的长度station_ID year Jan Feb Mar ... Dec
10160355 1878 NA 10 12 12
10160355 1879 12 12 13 10
...
10160355 2018 14 11 15 14
10160360 1963 12 10 12 14
10160360 1964 10 12 15 11
...
(repeats for all stations & total rows = 277604)
我需要的是:对于每个月度列,如果 df1$station < 50%,则将 df2 中的数据替换为所有行的 NA station/month - 否则,保持 df2 不变。因此,由于 df1$station_ID[1] 仅显示 1 月的 37%,因此该站 (df2$station[1:141]) 的所有 1 月都变为 NA。
我需要的示例输出:
station_ID year Jan Feb ... Dec
10160355 1878 NA NA NA
10160355 1879 NA NA NA
...
10160360 1963 12 10 14
10160360 1964 10 12 11
...
我已经尝试了大约 20 种不同的方法,但我认为我需要某种形式的带有 rep 的 dplyr,以便在条件为真时为每个站点的行重复 NA。
最近一次尝试,一次只有一个月,因为我不知道如何做所有的列:
df3 = df2 %>%
group_by(station_ID) %>%
select(Jan) %>%
mutate(if_else(df1$Jan < 50, rep(NA_character_, df1$n), Jan))
这给出了代表无效 'times' 的错误。我想我可能很接近,但我很感激任何建议!谢谢!
像这样的事情在 "long" 格式中更容易做到 - 尤其是在 dplyr
library(dplyr)
library(tidyr)
df1_long = pivot_longer(df1, cols = Jan:Dec, names_to = "month", values_to = "non_missing")
df2_long = pivot_longer(df2, cols = Jan:Dec, names_to = "month", values_to = "temp")
result_long = df2_long %>%
left_join(df1_long) %>%
mutate(temp = ifelse(non_missing < 50, NA, temp))
result_long
# # A tibble: 20 x 6
# station_ID year month temp n non_missing
# <int> <int> <chr> <int> <int> <int>
# 1 10160355 1878 Jan NA 141 37
# 2 10160355 1878 Feb NA 141 39
# 3 10160355 1878 Mar NA 141 38
# 4 10160355 1878 Dec NA 141 39
# 5 10160355 1879 Jan NA 141 37
# 6 10160355 1879 Feb NA 141 39
# 7 10160355 1879 Mar NA 141 38
# 8 10160355 1879 Dec NA 141 39
# 9 10160355 2018 Jan NA 141 37
# 10 10160355 2018 Feb NA 141 39
# 11 10160355 2018 Mar NA 141 38
# 12 10160355 2018 Dec NA 141 39
# 13 10160360 1963 Jan 12 56 94
# 14 10160360 1963 Feb 10 56 91
# 15 10160360 1963 Mar 12 56 98
# 16 10160360 1963 Dec 14 56 89
# 17 10160360 1964 Jan 10 56 94
# 18 10160360 1964 Feb 12 56 91
# 19 10160360 1964 Mar 15 56 98
# 20 10160360 1964 Dec 11 56 89
在很多情况下(尤其是制作图表,但也包括建模),我建议您坚持使用这种长格式数据。但是,它可以转换回您原来的宽幅格式:
result_wide = result_long %>%
select(-n, -non_missing) %>%
pivot_wider(names_from = "month", values_from = "temp")
result_wide
# # A tibble: 5 x 6
# station_ID year Jan Feb Mar Dec
# <int> <int> <int> <int> <int> <int>
# 1 10160355 1878 NA NA NA NA
# 2 10160355 1879 NA NA NA NA
# 3 10160355 2018 NA NA NA NA
# 4 10160360 1963 12 10 12 14
# 5 10160360 1964 10 12 15 11
使用此数据:
df1 = read.table(text = 'station_ID Jan Feb Mar Dec n
10160355 37 39 38 39 141
10160360 94 91 98 89 56
10160390 83 87 85 82 163', header = T)
df2 = read.table(text = 'station_ID year Jan Feb Mar Dec
10160355 1878 NA 10 12 12
10160355 1879 12 12 13 10
10160355 2018 14 11 15 14
10160360 1963 12 10 12 14
10160360 1964 10 12 15 11', header = T)