R:为什么合并删除数据?如何为合并插入缺失值

R: Why is merge dropping data? How to interpolate missing values for a merge

我正在尝试合并两个相对较大的数据集。我正在合并 SiteID - 这是位置的唯一指示符,以及 date/time,它由 Year、Month=Mo、Day 和 Hour=Hr 组成。

问题是 merge 正在某处丢弃数据。最小值、最大值、平均值和中值都会发生变化,而当它们应该是相同的数据时,只需合并即可。我已经把数据做成字符,检查了字符串是否匹配,但我仍然丢失数据。我也试过 left_join,但这似乎没有帮助。详情见下文。

编辑: 合并正在删除数据,因为并非每个 ("SiteID", "Year","Mo","Day", "Hr") 都存在数据。因此,在合并之前,我需要从 dB 中插入缺失值(请参阅下面的答案)。 结束编辑

请参阅页面底部的 link 以重现此示例。

PC17$Mo<-as.character(PC17$Mo)
PC17$Year<-as.character(PC17$Year)
PC17$Day<-as.character(PC17$Day)
PC17$Hr<-as.character(PC17$Hr)
PC17$SiteID<-as.character(PC17$SiteID)

dB$Mo<-as.character(dB$Mo)
dB$Year<-as.character(dB$Year)
dB$Day<-as.character(dB$Day)
dB$Hr<-as.character(dB$Hr)
dB$SiteID<-as.character(dB$SiteID)

# confirm that data are stored as characters
str(PC17)
str(dB)

现在比较我的 SiteID 值,我使用 unique 查看我有哪些字符串,并使用 setdiff 查看 R 是否识别出任何缺失。每一个都少了一个siteID,但是这没关系,因为它是真正的数据缺失(不是字符串问题)。

sort(unique(PC17$SiteID))
sort(unique(dB$SiteID))

setdiff(PC17$SiteID, dB$SiteID)  ## TR2U is the only one missing, this is ok
setdiff(dB$SiteID, PC17$SiteID)  ## FI7D is the only one missing, this is ok

现在,当我查看数据(按 SiteID 汇总)时,它看起来像一个不错的完整数据框 - 这意味着我拥有我应该拥有的每个站点的数据。

library(dplyr)
dB %>% 
  group_by(SiteID) %>% 
  summarise(
    min_dBL50=min(dbAL050, na.rm=TRUE),
    max_dBL50=max(dbAL050, na.rm=TRUE),
    mean_dBL50=mean(dbAL050, na.rm=TRUE),
    med_dBL50=median(dbAL050, na.rm=TRUE)
  )

# A tibble: 59 x 5
   SiteID min_dBL50 max_dBL50 mean_dBL50 med_dBL50
   <chr>      <dbl>     <dbl>      <dbl>     <dbl>
 1 CU1D        35.3      57.3       47.0      47.6
 2 CU1M        33.7      66.8       58.6      60.8
 3 CU1U        31.4      55.9       43.1      43.3
 4 CU2D        40        58.3       45.3      45.2
 5 CU2M        32.4      55.8       41.6      41.3
 6 CU2U        31.4      58.1       43.9      42.6
 7 CU3D        40.6      59.5       48.4      48.5
 8 CU3M        35.8      75.5       65.9      69.3
 9 CU3U        40.9      59.2       46.6      46.2
10 CU4D        36.6      49.1       43.6      43.4
# ... with 49 more rows

在这里,我通过 "SiteID", "Year","Mo","Day", "Hr" 合并两个数据集 PC17 和 dB - 保留所有 PC17 值(即使它们没有相应的 dB 值;all.x=TRUE)。

但是,当我查看此数据的摘要时,现在所有 SiteID 都有不同的值,并且某些站点完全缺失,例如 "CU3D" 和 "CU4D"。

PCdB<-(merge(PC17, dB, by=c("SiteID", "Year","Mo","Day", "Hr"), all.x=TRUE))

PCdB %>% 
  group_by(SiteID) %>% 
  summarise(
    min_dBL50=min(dbAL050, na.rm=TRUE),
    max_dBL50=max(dbAL050, na.rm=TRUE),
    mean_dBL50=mean(dbAL050, na.rm=TRUE),
    med_dBL50=median(dbAL050, na.rm=TRUE)
  )

# A tibble: 59 x 5
   SiteID min_dBL50 max_dBL50 mean_dBL50 med_dBL50
   <chr>      <dbl>     <dbl>      <dbl>     <dbl>
 1 CU1D        47.2      54         52.3      54  
 2 CU1M        35.4      63         49.2      49.2
 3 CU1U        35.3      35.3       35.3      35.3
 4 CU2D        42.3      42.3       42.3      42.3
 5 CU2M        43.1      43.2       43.1      43.1
 6 CU2U        43.7      43.7       43.7      43.7
 7 CU3D       Inf      -Inf        NaN        NA  
 8 CU3M        44.1      71.2       57.6      57.6
 9 CU3U        45        45         45        45  
10 CU4D       Inf      -Inf        NaN        NA  
# ... with 49 more rows

我将所有内容都设置为第一行带有 as.character() 的字符。此外,我用 setdiffunique 检查了 YearDayMoHr,就像我在上面用 [=18 做的一样=],那些不匹配的字符串似乎没有任何问题。

我也试过 dplyr 函数 left_join 来合并数据集,但并没有什么不同。

problay 在您的汇总函数中使用 na.rm = TRUE 时解决了...

一个data.table方法:

library( data.table )

dt.PC17 <- fread( "./PC_SO.csv" )
dt.dB <- fread( "./dB.csv" )

#data.table left join on "SiteID", "Year","Mo","Day", "Hr", and the summarise...
dt.PCdB <- dt.dB[ dt.PC17, on = .( SiteID, Year, Mo, Day, Hr ) ]
#summarise, and order by SiteID
result <- setorder( dt.PCdB[, list(min_dBL50  = min( dbAL050, na.rm = TRUE ),
                                   max_dBL50  = max( dbAL050, na.rm = TRUE ),
                                   mean_dBL50 = mean( dbAL050, na.rm = TRUE ),
                                   med_dBL50  = median( dbAL050, na.rm = TRUE ) 
                                   ), 
                            by = "SiteID" ], 
                    SiteID)

head( result, 10 )
#     SiteID min_dBL50 max_dBL50 mean_dBL50 med_dBL50
#  1:   CU1D      47.2      54.0     52.300     54.00
#  2:   CU1M      35.4      63.0     49.200     49.20
#  3:   CU1U      35.3      35.3     35.300     35.30
#  4:   CU2D      42.3      42.3     42.300     42.30
#  5:   CU2M      43.1      43.2     43.125     43.10
#  6:   CU2U      43.7      43.7     43.700     43.70
#  7:   CU3D       Inf      -Inf        NaN        NA
#  8:   CU3M      44.1      71.2     57.650     57.65
#  9:   CU3U      45.0      45.0     45.000     45.00
# 10:   CU4D       Inf      -Inf        NaN        NA

如果您想执行左连接,但要排除无法找到的命中(这样您就不会在 "CU3D" 上得到像上面那样的行)使用:

dt.PCdB <- dt.dB[ dt.PC17, on = .( SiteID, Year, Mo, Day, Hr ), nomatch = 0L ]

这将导致:

#     SiteID min_dBL50 max_dBL50 mean_dBL50 med_dBL50
#  1:   CU1D      47.2      54.0     52.300     54.00
#  2:   CU1M      35.4      63.0     49.200     49.20
#  3:   CU1U      35.3      35.3     35.300     35.30
#  4:   CU2D      42.3      42.3     42.300     42.30
#  5:   CU2M      43.1      43.2     43.125     43.10
#  6:   CU2U      43.7      43.7     43.700     43.70
#  7:   CU3M      44.1      71.2     57.650     57.65
#  8:   CU3U      45.0      45.0     45.000     45.00
#  9:   CU4M      52.4      55.9     54.150     54.15
# 10:   CU4U      51.3      51.3     51.300     51.30

最后回答这个问题,对数据有了更深的理解。合并函数本身并没有丢弃任何值,因为它只是完全按照人们告诉它的那样去做。但是,由于数据集由 SiteID, Year, Mo, Day, Hr 合并,结果是少数 SiteIDInf, NaN, and NA 值。

这样做的原因是 dB 不是一个可以合并的完全连续的数据集。因此,返回了某些 SiteIDInf, NaN, and NA 值,因为数据在 所有 变量 (SiteID, Year, Mo, Day, Hr) 中没有重叠。

所以我用插值法解决了这个问题。也就是说,我根据缺失值两侧的日期值填充了缺失值。 imputeTS 包在这里很有价值。

所以我先用数据插入日期之间的缺失值,然后重新合并数据集。

library(imputeTS)
library(tidyverse)

### We want to first interpolate dB values on the siteID first in dB dataset, BEFORE merging. 
### Why? Because the merge drops all the data that would help with the interpolation!!

dB<-read.csv("dB.csv")

dB_clean <- dB %>%
  mutate_if(is.integer, as.character)

# Create a wide table with spots for each minute. Missing will
#   show up as NA's
# All the NA's here in the columns represent 
#   missing jDays that we should add. jDay is an integer date 'julian date'
dB_NA_find <- dB_clean %>%
  count(SiteID, jDay) %>%
  spread(jDay, n)

dB_NA_find
# A tibble: 59 x 88
# SiteID `13633` `13634` `13635` `13636` `13637` `13638` `13639` `13640` `13641` 
# <fct>    <int>   <int>   <int>   <int>   <int>   <int>   <int>   <int>   <int>  
#   1 CU1D        NA      NA      NA      NA      NA      NA      NA      NA     
# 2 CU1M        NA      11      24      24      24      24      24      24      
# 3 CU1U        NA      11      24      24      24      24      24      24      
# 4 CU2D        NA      NA      NA      NA      NA      NA      NA      NA      
# 5 CU2M        NA       9      24      24      24      24      24      24      
# 6 CU2U        NA       9      24      24      24      24      21      NA      
# 7 CU3D        NA      NA      NA      NA      NA      NA      NA      NA      
# 8 CU3M        NA      NA      NA      NA      NA      NA      NA      NA      
# 9 CU3U        NA      NA      NA      NA      NA      NA      NA      NA     
# 10 CU4D        NA      NA      NA      NA      NA      NA      NA      NA     


# Take the NA minute entries and make the desired line for each
dB_rows_to_add <- dB_NA_find %>%
  gather(jDay, count, 2:88) %>%
  filter(is.na(count)) %>%
  select(-count, -NA) 

# Add these lines to the original,  remove the NA jDay rows 
#   (these have been replaced with jDay rows), and sort
dB <- dB_clean %>%
  bind_rows(dB_rows_to_add) %>%
  filter(jDay != "NA") %>%
  arrange(SiteID, jDay)


length((dB$DailyL50.x[is.na(dB$DailyL50.x)])) ## How many NAs do I have?
# [1] 3030

## Here is where we do the na.interpolation with package imputeTS
# prime the for loop with zeros
D<-rep("0",17)
sites<-unique(dB$SiteID)

for(i in 1:length(sites)){
  temp<-dB[dB$SiteID==sites[i], ]
  temp<-temp[order(temp$jDay),]
  temp$DayL50<-na.interpolation(temp$DailyL50.x, option="spline")
  D<-rbind(D, temp)
}

# delete the first row of zeros from above 'priming'
dBN<-D[-1,]

length((dBN$DayL50[is.na(dBN$DayL50)])) ## How many NAs do I have?
# [1] 0

因为我根据 jDay 对 NA 进行了上述插值,所以我缺少这些行的月份 (Mo)、DayYear 信息.

dBN$Year<-"2017"  #all data are from 2017

##I could not figure out how jDay was formatted, so I created a manual 'key' 
##to get Mo and Day by counting from a known date/jDay pair in original data

#Example:
# 13635 is Mo=5 Day=1
# 13665 is Mo=5 Day=31
# 13666 is Mo=6 Day=1
# 13695 is Mo=6 Day=30

key4<-data.frame("jDay"=c(13633:13634), "Day"=c(29:30), "Mo"=4)
key5<-data.frame("jDay"=c(13635:13665), "Day"=c(1:31), "Mo"=5)
key6<-data.frame("jDay"=c(13666:13695), "Day"=c(1:30), "Mo"=6)
key7<-data.frame("jDay"=c(13696:13719), "Day"=c(1:24), "Mo"=7)

#make master 'key'
key<-rbind(key4,key5,key6,key7)

# Merge 'key' with dataset so all rows now have 'Mo' and 'Day' values
dBM<-merge(dBN, key, by="jDay", all.x=TRUE)

#clean unecessary columns and rename 'Mo' and 'Day' so it matches PC17 dataset
dBM<-dBM[ , -c(2,3,6:16)]
colnames(dBM)[5:6]<-c("Day","Mo")

#I noticed an issue with duplication - merge with PC17 created a massive dataframe
dBM %>%  ### Have too many observations per day, will duplicate merge out of control.
  count(SiteID, jDay, DayL50) %>% 
  summarise(
    min=min(n, na.rm=TRUE),
    mean=mean(n, na.rm=TRUE),
    max=max(n, na.rm=TRUE)
  )

## to fix this I only kept distinct observations so that each day has 1 observation
dB<-distinct(dBM, .keep_all = TRUE)
### Now run above line again to check how many observations per day are left. Should be 1

现在,当您使用 dB 和 PC17 进行合并时,应该包括插值(之前缺少 NA)。它看起来像这样:

> PCdB<-(merge(PC17, dB, by=c("SiteID", "Year","Mo","Day"), all.x=TRUE, all=FALSE,no.dups=TRUE))
> ### all.x=TRUE is important. This keeps all PC17 data, even stuff that DOESNT have dB data that corresponds to it.

> library(dplyr)

#Here is the NA interpolated 'dB' dataset 
> dB %>% 
+   group_by(SiteID) %>% 
+   dplyr::summarise(
+     min_dBL50=min(DayL50, na.rm=TRUE),
+     max_dBL50=max(DayL50, na.rm=TRUE),
+     mean_dBL50=mean(DayL50, na.rm=TRUE),
+     med_dBL50=median(DayL50, na.rm=TRUE)
+   )
# A tibble: 59 x 5
   SiteID min_dBL50 max_dBL50 mean_dBL50 med_dBL50
   <chr>      <dbl>     <dbl>      <dbl>     <dbl>
 1 CU1D        44.7      53.1       49.4      50.2
 2 CU1M        37.6      65.2       59.5      62.6
 3 CU1U        35.5      51         43.7      44.8
 4 CU2D        42        52         47.8      49.3
 5 CU2M        38.2      49         43.1      42.9
 6 CU2U        34.1      53.7       46.5      47  
 7 CU3D        46.1      53.3       49.7      49.4
 8 CU3M        44.5      73.5       61.9      68.2
 9 CU3U        42        52.6       47.0      46.8
10 CU4D        42        45.3       44.0      44.6
# ... with 49 more rows

# Now here is the PCdB merged dataset, and we are no longer missing values!
> PCdB %>% 
+   group_by(SiteID) %>% 
+   dplyr::summarise(
+     min_dBL50=min(DayL50, na.rm=TRUE),
+     max_dBL50=max(DayL50, na.rm=TRUE),
+     mean_dBL50=mean(DayL50, na.rm=TRUE),
+     med_dBL50=median(DayL50, na.rm=TRUE)
+   )
# A tibble: 60 x 5
   SiteID min_dBL50 max_dBL50 mean_dBL50 med_dBL50
   <chr>      <dbl>     <dbl>      <dbl>     <dbl>
 1 CU1D        44.8      50         46.8      47  
 2 CU1M        59        63.9       62.3      62.9
 3 CU1U        37.9      46         43.6      44.4
 4 CU2D        42.1      51.6       45.6      44.3
 5 CU2M        38.4      48.3       44.2      45.5
 6 CU2U        39.8      50.7       45.7      46.4
 7 CU3D        46.5      49.5       47.7      47.7
 8 CU3M        67.7      71.2       69.5      69.4
 9 CU3U        43.3      52.6       48.1      48.2
10 CU4D        43.2      45.3       44.4      44.9
# ... with 50 more rows