R:为什么合并删除数据?如何为合并插入缺失值
R: Why is merge dropping data? How to interpolate missing values for a merge
我正在尝试合并两个相对较大的数据集。我正在合并 SiteID - 这是位置的唯一指示符,以及 date/time,它由 Year、Month=Mo、Day 和 Hour=Hr 组成。
问题是 merge
正在某处丢弃数据。最小值、最大值、平均值和中值都会发生变化,而当它们应该是相同的数据时,只需合并即可。我已经把数据做成字符,检查了字符串是否匹配,但我仍然丢失数据。我也试过 left_join
,但这似乎没有帮助。详情见下文。
编辑: 合并正在删除数据,因为并非每个 ("SiteID", "Year","Mo","Day", "Hr")
都存在数据。因此,在合并之前,我需要从 dB
中插入缺失值(请参阅下面的答案)。
结束编辑
请参阅页面底部的 link 以重现此示例。
PC17$Mo<-as.character(PC17$Mo)
PC17$Year<-as.character(PC17$Year)
PC17$Day<-as.character(PC17$Day)
PC17$Hr<-as.character(PC17$Hr)
PC17$SiteID<-as.character(PC17$SiteID)
dB$Mo<-as.character(dB$Mo)
dB$Year<-as.character(dB$Year)
dB$Day<-as.character(dB$Day)
dB$Hr<-as.character(dB$Hr)
dB$SiteID<-as.character(dB$SiteID)
# confirm that data are stored as characters
str(PC17)
str(dB)
现在比较我的 SiteID
值,我使用 unique 查看我有哪些字符串,并使用 setdiff
查看 R 是否识别出任何缺失。每一个都少了一个siteID,但是这没关系,因为它是真正的数据缺失(不是字符串问题)。
sort(unique(PC17$SiteID))
sort(unique(dB$SiteID))
setdiff(PC17$SiteID, dB$SiteID) ## TR2U is the only one missing, this is ok
setdiff(dB$SiteID, PC17$SiteID) ## FI7D is the only one missing, this is ok
现在,当我查看数据(按 SiteID 汇总)时,它看起来像一个不错的完整数据框 - 这意味着我拥有我应该拥有的每个站点的数据。
library(dplyr)
dB %>%
group_by(SiteID) %>%
summarise(
min_dBL50=min(dbAL050, na.rm=TRUE),
max_dBL50=max(dbAL050, na.rm=TRUE),
mean_dBL50=mean(dbAL050, na.rm=TRUE),
med_dBL50=median(dbAL050, na.rm=TRUE)
)
# A tibble: 59 x 5
SiteID min_dBL50 max_dBL50 mean_dBL50 med_dBL50
<chr> <dbl> <dbl> <dbl> <dbl>
1 CU1D 35.3 57.3 47.0 47.6
2 CU1M 33.7 66.8 58.6 60.8
3 CU1U 31.4 55.9 43.1 43.3
4 CU2D 40 58.3 45.3 45.2
5 CU2M 32.4 55.8 41.6 41.3
6 CU2U 31.4 58.1 43.9 42.6
7 CU3D 40.6 59.5 48.4 48.5
8 CU3M 35.8 75.5 65.9 69.3
9 CU3U 40.9 59.2 46.6 46.2
10 CU4D 36.6 49.1 43.6 43.4
# ... with 49 more rows
在这里,我通过 "SiteID", "Year","Mo","Day", "Hr"
合并两个数据集 PC17 和 dB - 保留所有 PC17 值(即使它们没有相应的 dB 值;all.x=TRUE
)。
但是,当我查看此数据的摘要时,现在所有 SiteID
都有不同的值,并且某些站点完全缺失,例如 "CU3D" 和 "CU4D"。
PCdB<-(merge(PC17, dB, by=c("SiteID", "Year","Mo","Day", "Hr"), all.x=TRUE))
PCdB %>%
group_by(SiteID) %>%
summarise(
min_dBL50=min(dbAL050, na.rm=TRUE),
max_dBL50=max(dbAL050, na.rm=TRUE),
mean_dBL50=mean(dbAL050, na.rm=TRUE),
med_dBL50=median(dbAL050, na.rm=TRUE)
)
# A tibble: 59 x 5
SiteID min_dBL50 max_dBL50 mean_dBL50 med_dBL50
<chr> <dbl> <dbl> <dbl> <dbl>
1 CU1D 47.2 54 52.3 54
2 CU1M 35.4 63 49.2 49.2
3 CU1U 35.3 35.3 35.3 35.3
4 CU2D 42.3 42.3 42.3 42.3
5 CU2M 43.1 43.2 43.1 43.1
6 CU2U 43.7 43.7 43.7 43.7
7 CU3D Inf -Inf NaN NA
8 CU3M 44.1 71.2 57.6 57.6
9 CU3U 45 45 45 45
10 CU4D Inf -Inf NaN NA
# ... with 49 more rows
我将所有内容都设置为第一行带有 as.character()
的字符。此外,我用 setdiff
和 unique
检查了 Year
、Day
、Mo
和 Hr
,就像我在上面用 [=18 做的一样=],那些不匹配的字符串似乎没有任何问题。
我也试过 dplyr
函数 left_join
来合并数据集,但并没有什么不同。
problay 在您的汇总函数中使用 na.rm = TRUE
时解决了...
一个data.table方法:
library( data.table )
dt.PC17 <- fread( "./PC_SO.csv" )
dt.dB <- fread( "./dB.csv" )
#data.table left join on "SiteID", "Year","Mo","Day", "Hr", and the summarise...
dt.PCdB <- dt.dB[ dt.PC17, on = .( SiteID, Year, Mo, Day, Hr ) ]
#summarise, and order by SiteID
result <- setorder( dt.PCdB[, list(min_dBL50 = min( dbAL050, na.rm = TRUE ),
max_dBL50 = max( dbAL050, na.rm = TRUE ),
mean_dBL50 = mean( dbAL050, na.rm = TRUE ),
med_dBL50 = median( dbAL050, na.rm = TRUE )
),
by = "SiteID" ],
SiteID)
head( result, 10 )
# SiteID min_dBL50 max_dBL50 mean_dBL50 med_dBL50
# 1: CU1D 47.2 54.0 52.300 54.00
# 2: CU1M 35.4 63.0 49.200 49.20
# 3: CU1U 35.3 35.3 35.300 35.30
# 4: CU2D 42.3 42.3 42.300 42.30
# 5: CU2M 43.1 43.2 43.125 43.10
# 6: CU2U 43.7 43.7 43.700 43.70
# 7: CU3D Inf -Inf NaN NA
# 8: CU3M 44.1 71.2 57.650 57.65
# 9: CU3U 45.0 45.0 45.000 45.00
# 10: CU4D Inf -Inf NaN NA
如果您想执行左连接,但要排除无法找到的命中(这样您就不会在 "CU3D" 上得到像上面那样的行)使用:
dt.PCdB <- dt.dB[ dt.PC17, on = .( SiteID, Year, Mo, Day, Hr ), nomatch = 0L ]
这将导致:
# SiteID min_dBL50 max_dBL50 mean_dBL50 med_dBL50
# 1: CU1D 47.2 54.0 52.300 54.00
# 2: CU1M 35.4 63.0 49.200 49.20
# 3: CU1U 35.3 35.3 35.300 35.30
# 4: CU2D 42.3 42.3 42.300 42.30
# 5: CU2M 43.1 43.2 43.125 43.10
# 6: CU2U 43.7 43.7 43.700 43.70
# 7: CU3M 44.1 71.2 57.650 57.65
# 8: CU3U 45.0 45.0 45.000 45.00
# 9: CU4M 52.4 55.9 54.150 54.15
# 10: CU4U 51.3 51.3 51.300 51.30
最后回答这个问题,对数据有了更深的理解。合并函数本身并没有丢弃任何值,因为它只是完全按照人们告诉它的那样去做。但是,由于数据集由 SiteID, Year, Mo, Day, Hr
合并,结果是少数 SiteID
的 Inf, NaN, and NA
值。
这样做的原因是 dB 不是一个可以合并的完全连续的数据集。因此,返回了某些 SiteID
的 Inf, NaN, and NA
值,因为数据在 所有 变量 (SiteID, Year, Mo, Day, Hr
) 中没有重叠。
所以我用插值法解决了这个问题。也就是说,我根据缺失值两侧的日期值填充了缺失值。 imputeTS
包在这里很有价值。
所以我先用数据插入日期之间的缺失值,然后重新合并数据集。
library(imputeTS)
library(tidyverse)
### We want to first interpolate dB values on the siteID first in dB dataset, BEFORE merging.
### Why? Because the merge drops all the data that would help with the interpolation!!
dB<-read.csv("dB.csv")
dB_clean <- dB %>%
mutate_if(is.integer, as.character)
# Create a wide table with spots for each minute. Missing will
# show up as NA's
# All the NA's here in the columns represent
# missing jDays that we should add. jDay is an integer date 'julian date'
dB_NA_find <- dB_clean %>%
count(SiteID, jDay) %>%
spread(jDay, n)
dB_NA_find
# A tibble: 59 x 88
# SiteID `13633` `13634` `13635` `13636` `13637` `13638` `13639` `13640` `13641`
# <fct> <int> <int> <int> <int> <int> <int> <int> <int> <int>
# 1 CU1D NA NA NA NA NA NA NA NA
# 2 CU1M NA 11 24 24 24 24 24 24
# 3 CU1U NA 11 24 24 24 24 24 24
# 4 CU2D NA NA NA NA NA NA NA NA
# 5 CU2M NA 9 24 24 24 24 24 24
# 6 CU2U NA 9 24 24 24 24 21 NA
# 7 CU3D NA NA NA NA NA NA NA NA
# 8 CU3M NA NA NA NA NA NA NA NA
# 9 CU3U NA NA NA NA NA NA NA NA
# 10 CU4D NA NA NA NA NA NA NA NA
# Take the NA minute entries and make the desired line for each
dB_rows_to_add <- dB_NA_find %>%
gather(jDay, count, 2:88) %>%
filter(is.na(count)) %>%
select(-count, -NA)
# Add these lines to the original, remove the NA jDay rows
# (these have been replaced with jDay rows), and sort
dB <- dB_clean %>%
bind_rows(dB_rows_to_add) %>%
filter(jDay != "NA") %>%
arrange(SiteID, jDay)
length((dB$DailyL50.x[is.na(dB$DailyL50.x)])) ## How many NAs do I have?
# [1] 3030
## Here is where we do the na.interpolation with package imputeTS
# prime the for loop with zeros
D<-rep("0",17)
sites<-unique(dB$SiteID)
for(i in 1:length(sites)){
temp<-dB[dB$SiteID==sites[i], ]
temp<-temp[order(temp$jDay),]
temp$DayL50<-na.interpolation(temp$DailyL50.x, option="spline")
D<-rbind(D, temp)
}
# delete the first row of zeros from above 'priming'
dBN<-D[-1,]
length((dBN$DayL50[is.na(dBN$DayL50)])) ## How many NAs do I have?
# [1] 0
因为我根据 jDay
对 NA 进行了上述插值,所以我缺少这些行的月份 (Mo
)、Day
和 Year
信息.
dBN$Year<-"2017" #all data are from 2017
##I could not figure out how jDay was formatted, so I created a manual 'key'
##to get Mo and Day by counting from a known date/jDay pair in original data
#Example:
# 13635 is Mo=5 Day=1
# 13665 is Mo=5 Day=31
# 13666 is Mo=6 Day=1
# 13695 is Mo=6 Day=30
key4<-data.frame("jDay"=c(13633:13634), "Day"=c(29:30), "Mo"=4)
key5<-data.frame("jDay"=c(13635:13665), "Day"=c(1:31), "Mo"=5)
key6<-data.frame("jDay"=c(13666:13695), "Day"=c(1:30), "Mo"=6)
key7<-data.frame("jDay"=c(13696:13719), "Day"=c(1:24), "Mo"=7)
#make master 'key'
key<-rbind(key4,key5,key6,key7)
# Merge 'key' with dataset so all rows now have 'Mo' and 'Day' values
dBM<-merge(dBN, key, by="jDay", all.x=TRUE)
#clean unecessary columns and rename 'Mo' and 'Day' so it matches PC17 dataset
dBM<-dBM[ , -c(2,3,6:16)]
colnames(dBM)[5:6]<-c("Day","Mo")
#I noticed an issue with duplication - merge with PC17 created a massive dataframe
dBM %>% ### Have too many observations per day, will duplicate merge out of control.
count(SiteID, jDay, DayL50) %>%
summarise(
min=min(n, na.rm=TRUE),
mean=mean(n, na.rm=TRUE),
max=max(n, na.rm=TRUE)
)
## to fix this I only kept distinct observations so that each day has 1 observation
dB<-distinct(dBM, .keep_all = TRUE)
### Now run above line again to check how many observations per day are left. Should be 1
现在,当您使用 dB 和 PC17 进行合并时,应该包括插值(之前缺少 NA)。它看起来像这样:
> PCdB<-(merge(PC17, dB, by=c("SiteID", "Year","Mo","Day"), all.x=TRUE, all=FALSE,no.dups=TRUE))
> ### all.x=TRUE is important. This keeps all PC17 data, even stuff that DOESNT have dB data that corresponds to it.
> library(dplyr)
#Here is the NA interpolated 'dB' dataset
> dB %>%
+ group_by(SiteID) %>%
+ dplyr::summarise(
+ min_dBL50=min(DayL50, na.rm=TRUE),
+ max_dBL50=max(DayL50, na.rm=TRUE),
+ mean_dBL50=mean(DayL50, na.rm=TRUE),
+ med_dBL50=median(DayL50, na.rm=TRUE)
+ )
# A tibble: 59 x 5
SiteID min_dBL50 max_dBL50 mean_dBL50 med_dBL50
<chr> <dbl> <dbl> <dbl> <dbl>
1 CU1D 44.7 53.1 49.4 50.2
2 CU1M 37.6 65.2 59.5 62.6
3 CU1U 35.5 51 43.7 44.8
4 CU2D 42 52 47.8 49.3
5 CU2M 38.2 49 43.1 42.9
6 CU2U 34.1 53.7 46.5 47
7 CU3D 46.1 53.3 49.7 49.4
8 CU3M 44.5 73.5 61.9 68.2
9 CU3U 42 52.6 47.0 46.8
10 CU4D 42 45.3 44.0 44.6
# ... with 49 more rows
# Now here is the PCdB merged dataset, and we are no longer missing values!
> PCdB %>%
+ group_by(SiteID) %>%
+ dplyr::summarise(
+ min_dBL50=min(DayL50, na.rm=TRUE),
+ max_dBL50=max(DayL50, na.rm=TRUE),
+ mean_dBL50=mean(DayL50, na.rm=TRUE),
+ med_dBL50=median(DayL50, na.rm=TRUE)
+ )
# A tibble: 60 x 5
SiteID min_dBL50 max_dBL50 mean_dBL50 med_dBL50
<chr> <dbl> <dbl> <dbl> <dbl>
1 CU1D 44.8 50 46.8 47
2 CU1M 59 63.9 62.3 62.9
3 CU1U 37.9 46 43.6 44.4
4 CU2D 42.1 51.6 45.6 44.3
5 CU2M 38.4 48.3 44.2 45.5
6 CU2U 39.8 50.7 45.7 46.4
7 CU3D 46.5 49.5 47.7 47.7
8 CU3M 67.7 71.2 69.5 69.4
9 CU3U 43.3 52.6 48.1 48.2
10 CU4D 43.2 45.3 44.4 44.9
# ... with 50 more rows
我正在尝试合并两个相对较大的数据集。我正在合并 SiteID - 这是位置的唯一指示符,以及 date/time,它由 Year、Month=Mo、Day 和 Hour=Hr 组成。
问题是 merge
正在某处丢弃数据。最小值、最大值、平均值和中值都会发生变化,而当它们应该是相同的数据时,只需合并即可。我已经把数据做成字符,检查了字符串是否匹配,但我仍然丢失数据。我也试过 left_join
,但这似乎没有帮助。详情见下文。
编辑: 合并正在删除数据,因为并非每个 ("SiteID", "Year","Mo","Day", "Hr")
都存在数据。因此,在合并之前,我需要从 dB
中插入缺失值(请参阅下面的答案)。
结束编辑
请参阅页面底部的 link 以重现此示例。
PC17$Mo<-as.character(PC17$Mo)
PC17$Year<-as.character(PC17$Year)
PC17$Day<-as.character(PC17$Day)
PC17$Hr<-as.character(PC17$Hr)
PC17$SiteID<-as.character(PC17$SiteID)
dB$Mo<-as.character(dB$Mo)
dB$Year<-as.character(dB$Year)
dB$Day<-as.character(dB$Day)
dB$Hr<-as.character(dB$Hr)
dB$SiteID<-as.character(dB$SiteID)
# confirm that data are stored as characters
str(PC17)
str(dB)
现在比较我的 SiteID
值,我使用 unique 查看我有哪些字符串,并使用 setdiff
查看 R 是否识别出任何缺失。每一个都少了一个siteID,但是这没关系,因为它是真正的数据缺失(不是字符串问题)。
sort(unique(PC17$SiteID))
sort(unique(dB$SiteID))
setdiff(PC17$SiteID, dB$SiteID) ## TR2U is the only one missing, this is ok
setdiff(dB$SiteID, PC17$SiteID) ## FI7D is the only one missing, this is ok
现在,当我查看数据(按 SiteID 汇总)时,它看起来像一个不错的完整数据框 - 这意味着我拥有我应该拥有的每个站点的数据。
library(dplyr)
dB %>%
group_by(SiteID) %>%
summarise(
min_dBL50=min(dbAL050, na.rm=TRUE),
max_dBL50=max(dbAL050, na.rm=TRUE),
mean_dBL50=mean(dbAL050, na.rm=TRUE),
med_dBL50=median(dbAL050, na.rm=TRUE)
)
# A tibble: 59 x 5
SiteID min_dBL50 max_dBL50 mean_dBL50 med_dBL50
<chr> <dbl> <dbl> <dbl> <dbl>
1 CU1D 35.3 57.3 47.0 47.6
2 CU1M 33.7 66.8 58.6 60.8
3 CU1U 31.4 55.9 43.1 43.3
4 CU2D 40 58.3 45.3 45.2
5 CU2M 32.4 55.8 41.6 41.3
6 CU2U 31.4 58.1 43.9 42.6
7 CU3D 40.6 59.5 48.4 48.5
8 CU3M 35.8 75.5 65.9 69.3
9 CU3U 40.9 59.2 46.6 46.2
10 CU4D 36.6 49.1 43.6 43.4
# ... with 49 more rows
在这里,我通过 "SiteID", "Year","Mo","Day", "Hr"
合并两个数据集 PC17 和 dB - 保留所有 PC17 值(即使它们没有相应的 dB 值;all.x=TRUE
)。
但是,当我查看此数据的摘要时,现在所有 SiteID
都有不同的值,并且某些站点完全缺失,例如 "CU3D" 和 "CU4D"。
PCdB<-(merge(PC17, dB, by=c("SiteID", "Year","Mo","Day", "Hr"), all.x=TRUE))
PCdB %>%
group_by(SiteID) %>%
summarise(
min_dBL50=min(dbAL050, na.rm=TRUE),
max_dBL50=max(dbAL050, na.rm=TRUE),
mean_dBL50=mean(dbAL050, na.rm=TRUE),
med_dBL50=median(dbAL050, na.rm=TRUE)
)
# A tibble: 59 x 5
SiteID min_dBL50 max_dBL50 mean_dBL50 med_dBL50
<chr> <dbl> <dbl> <dbl> <dbl>
1 CU1D 47.2 54 52.3 54
2 CU1M 35.4 63 49.2 49.2
3 CU1U 35.3 35.3 35.3 35.3
4 CU2D 42.3 42.3 42.3 42.3
5 CU2M 43.1 43.2 43.1 43.1
6 CU2U 43.7 43.7 43.7 43.7
7 CU3D Inf -Inf NaN NA
8 CU3M 44.1 71.2 57.6 57.6
9 CU3U 45 45 45 45
10 CU4D Inf -Inf NaN NA
# ... with 49 more rows
我将所有内容都设置为第一行带有 as.character()
的字符。此外,我用 setdiff
和 unique
检查了 Year
、Day
、Mo
和 Hr
,就像我在上面用 [=18 做的一样=],那些不匹配的字符串似乎没有任何问题。
我也试过 dplyr
函数 left_join
来合并数据集,但并没有什么不同。
problay 在您的汇总函数中使用 na.rm = TRUE
时解决了...
一个data.table方法:
library( data.table )
dt.PC17 <- fread( "./PC_SO.csv" )
dt.dB <- fread( "./dB.csv" )
#data.table left join on "SiteID", "Year","Mo","Day", "Hr", and the summarise...
dt.PCdB <- dt.dB[ dt.PC17, on = .( SiteID, Year, Mo, Day, Hr ) ]
#summarise, and order by SiteID
result <- setorder( dt.PCdB[, list(min_dBL50 = min( dbAL050, na.rm = TRUE ),
max_dBL50 = max( dbAL050, na.rm = TRUE ),
mean_dBL50 = mean( dbAL050, na.rm = TRUE ),
med_dBL50 = median( dbAL050, na.rm = TRUE )
),
by = "SiteID" ],
SiteID)
head( result, 10 )
# SiteID min_dBL50 max_dBL50 mean_dBL50 med_dBL50
# 1: CU1D 47.2 54.0 52.300 54.00
# 2: CU1M 35.4 63.0 49.200 49.20
# 3: CU1U 35.3 35.3 35.300 35.30
# 4: CU2D 42.3 42.3 42.300 42.30
# 5: CU2M 43.1 43.2 43.125 43.10
# 6: CU2U 43.7 43.7 43.700 43.70
# 7: CU3D Inf -Inf NaN NA
# 8: CU3M 44.1 71.2 57.650 57.65
# 9: CU3U 45.0 45.0 45.000 45.00
# 10: CU4D Inf -Inf NaN NA
如果您想执行左连接,但要排除无法找到的命中(这样您就不会在 "CU3D" 上得到像上面那样的行)使用:
dt.PCdB <- dt.dB[ dt.PC17, on = .( SiteID, Year, Mo, Day, Hr ), nomatch = 0L ]
这将导致:
# SiteID min_dBL50 max_dBL50 mean_dBL50 med_dBL50
# 1: CU1D 47.2 54.0 52.300 54.00
# 2: CU1M 35.4 63.0 49.200 49.20
# 3: CU1U 35.3 35.3 35.300 35.30
# 4: CU2D 42.3 42.3 42.300 42.30
# 5: CU2M 43.1 43.2 43.125 43.10
# 6: CU2U 43.7 43.7 43.700 43.70
# 7: CU3M 44.1 71.2 57.650 57.65
# 8: CU3U 45.0 45.0 45.000 45.00
# 9: CU4M 52.4 55.9 54.150 54.15
# 10: CU4U 51.3 51.3 51.300 51.30
最后回答这个问题,对数据有了更深的理解。合并函数本身并没有丢弃任何值,因为它只是完全按照人们告诉它的那样去做。但是,由于数据集由 SiteID, Year, Mo, Day, Hr
合并,结果是少数 SiteID
的 Inf, NaN, and NA
值。
这样做的原因是 dB 不是一个可以合并的完全连续的数据集。因此,返回了某些 SiteID
的 Inf, NaN, and NA
值,因为数据在 所有 变量 (SiteID, Year, Mo, Day, Hr
) 中没有重叠。
所以我用插值法解决了这个问题。也就是说,我根据缺失值两侧的日期值填充了缺失值。 imputeTS
包在这里很有价值。
所以我先用数据插入日期之间的缺失值,然后重新合并数据集。
library(imputeTS)
library(tidyverse)
### We want to first interpolate dB values on the siteID first in dB dataset, BEFORE merging.
### Why? Because the merge drops all the data that would help with the interpolation!!
dB<-read.csv("dB.csv")
dB_clean <- dB %>%
mutate_if(is.integer, as.character)
# Create a wide table with spots for each minute. Missing will
# show up as NA's
# All the NA's here in the columns represent
# missing jDays that we should add. jDay is an integer date 'julian date'
dB_NA_find <- dB_clean %>%
count(SiteID, jDay) %>%
spread(jDay, n)
dB_NA_find
# A tibble: 59 x 88
# SiteID `13633` `13634` `13635` `13636` `13637` `13638` `13639` `13640` `13641`
# <fct> <int> <int> <int> <int> <int> <int> <int> <int> <int>
# 1 CU1D NA NA NA NA NA NA NA NA
# 2 CU1M NA 11 24 24 24 24 24 24
# 3 CU1U NA 11 24 24 24 24 24 24
# 4 CU2D NA NA NA NA NA NA NA NA
# 5 CU2M NA 9 24 24 24 24 24 24
# 6 CU2U NA 9 24 24 24 24 21 NA
# 7 CU3D NA NA NA NA NA NA NA NA
# 8 CU3M NA NA NA NA NA NA NA NA
# 9 CU3U NA NA NA NA NA NA NA NA
# 10 CU4D NA NA NA NA NA NA NA NA
# Take the NA minute entries and make the desired line for each
dB_rows_to_add <- dB_NA_find %>%
gather(jDay, count, 2:88) %>%
filter(is.na(count)) %>%
select(-count, -NA)
# Add these lines to the original, remove the NA jDay rows
# (these have been replaced with jDay rows), and sort
dB <- dB_clean %>%
bind_rows(dB_rows_to_add) %>%
filter(jDay != "NA") %>%
arrange(SiteID, jDay)
length((dB$DailyL50.x[is.na(dB$DailyL50.x)])) ## How many NAs do I have?
# [1] 3030
## Here is where we do the na.interpolation with package imputeTS
# prime the for loop with zeros
D<-rep("0",17)
sites<-unique(dB$SiteID)
for(i in 1:length(sites)){
temp<-dB[dB$SiteID==sites[i], ]
temp<-temp[order(temp$jDay),]
temp$DayL50<-na.interpolation(temp$DailyL50.x, option="spline")
D<-rbind(D, temp)
}
# delete the first row of zeros from above 'priming'
dBN<-D[-1,]
length((dBN$DayL50[is.na(dBN$DayL50)])) ## How many NAs do I have?
# [1] 0
因为我根据 jDay
对 NA 进行了上述插值,所以我缺少这些行的月份 (Mo
)、Day
和 Year
信息.
dBN$Year<-"2017" #all data are from 2017
##I could not figure out how jDay was formatted, so I created a manual 'key'
##to get Mo and Day by counting from a known date/jDay pair in original data
#Example:
# 13635 is Mo=5 Day=1
# 13665 is Mo=5 Day=31
# 13666 is Mo=6 Day=1
# 13695 is Mo=6 Day=30
key4<-data.frame("jDay"=c(13633:13634), "Day"=c(29:30), "Mo"=4)
key5<-data.frame("jDay"=c(13635:13665), "Day"=c(1:31), "Mo"=5)
key6<-data.frame("jDay"=c(13666:13695), "Day"=c(1:30), "Mo"=6)
key7<-data.frame("jDay"=c(13696:13719), "Day"=c(1:24), "Mo"=7)
#make master 'key'
key<-rbind(key4,key5,key6,key7)
# Merge 'key' with dataset so all rows now have 'Mo' and 'Day' values
dBM<-merge(dBN, key, by="jDay", all.x=TRUE)
#clean unecessary columns and rename 'Mo' and 'Day' so it matches PC17 dataset
dBM<-dBM[ , -c(2,3,6:16)]
colnames(dBM)[5:6]<-c("Day","Mo")
#I noticed an issue with duplication - merge with PC17 created a massive dataframe
dBM %>% ### Have too many observations per day, will duplicate merge out of control.
count(SiteID, jDay, DayL50) %>%
summarise(
min=min(n, na.rm=TRUE),
mean=mean(n, na.rm=TRUE),
max=max(n, na.rm=TRUE)
)
## to fix this I only kept distinct observations so that each day has 1 observation
dB<-distinct(dBM, .keep_all = TRUE)
### Now run above line again to check how many observations per day are left. Should be 1
现在,当您使用 dB 和 PC17 进行合并时,应该包括插值(之前缺少 NA)。它看起来像这样:
> PCdB<-(merge(PC17, dB, by=c("SiteID", "Year","Mo","Day"), all.x=TRUE, all=FALSE,no.dups=TRUE))
> ### all.x=TRUE is important. This keeps all PC17 data, even stuff that DOESNT have dB data that corresponds to it.
> library(dplyr)
#Here is the NA interpolated 'dB' dataset
> dB %>%
+ group_by(SiteID) %>%
+ dplyr::summarise(
+ min_dBL50=min(DayL50, na.rm=TRUE),
+ max_dBL50=max(DayL50, na.rm=TRUE),
+ mean_dBL50=mean(DayL50, na.rm=TRUE),
+ med_dBL50=median(DayL50, na.rm=TRUE)
+ )
# A tibble: 59 x 5
SiteID min_dBL50 max_dBL50 mean_dBL50 med_dBL50
<chr> <dbl> <dbl> <dbl> <dbl>
1 CU1D 44.7 53.1 49.4 50.2
2 CU1M 37.6 65.2 59.5 62.6
3 CU1U 35.5 51 43.7 44.8
4 CU2D 42 52 47.8 49.3
5 CU2M 38.2 49 43.1 42.9
6 CU2U 34.1 53.7 46.5 47
7 CU3D 46.1 53.3 49.7 49.4
8 CU3M 44.5 73.5 61.9 68.2
9 CU3U 42 52.6 47.0 46.8
10 CU4D 42 45.3 44.0 44.6
# ... with 49 more rows
# Now here is the PCdB merged dataset, and we are no longer missing values!
> PCdB %>%
+ group_by(SiteID) %>%
+ dplyr::summarise(
+ min_dBL50=min(DayL50, na.rm=TRUE),
+ max_dBL50=max(DayL50, na.rm=TRUE),
+ mean_dBL50=mean(DayL50, na.rm=TRUE),
+ med_dBL50=median(DayL50, na.rm=TRUE)
+ )
# A tibble: 60 x 5
SiteID min_dBL50 max_dBL50 mean_dBL50 med_dBL50
<chr> <dbl> <dbl> <dbl> <dbl>
1 CU1D 44.8 50 46.8 47
2 CU1M 59 63.9 62.3 62.9
3 CU1U 37.9 46 43.6 44.4
4 CU2D 42.1 51.6 45.6 44.3
5 CU2M 38.4 48.3 44.2 45.5
6 CU2U 39.8 50.7 45.7 46.4
7 CU3D 46.5 49.5 47.7 47.7
8 CU3M 67.7 71.2 69.5 69.4
9 CU3U 43.3 52.6 48.1 48.2
10 CU4D 43.2 45.3 44.4 44.9
# ... with 50 more rows