当只有 POSIXct 列的时间部分不同时,如何查找重复的行?
How to find duplicated rows when only time part of POSIXct column is different?
我有一些行,其中缺少 POSIXct 列的时间部分(=等于 00:00:00)。我应该如何找到只有时间不同的重复行?
如果我使用如下代码:
dataDuplicates <- data[duplicated(data, by = NULL) | duplicated(data, by = NULL, fromLast = TRUE), ]
那么就不会发现这样的情况。
如果我使用下面的代码:
setkey(data, <all fields are there except that data field>, physical = TRUE)
dataDuplicates <- data[duplicated(data) | duplicated(data, fromLast = TRUE), ]
那么即使日期不一样也会发现
测试代码如下:
zz <- "or,d,ddate,rdate,changes,class,price,fdate,company,number,minutes,added,source
VA3,VA4,2014-05-24 12:23:00,,0,0,2124,2014-05-22 15:50:16,,,,2014-05-22 12:20:03,ss
VA1,VA2,2014-05-26 14:00:01,,0,0,2124,2014-05-22 15:03:44,,,,2014-05-22 12:20:03,s1
VA1,VA2,2014-05-26 00:00:00,,0,0,2124,2014-05-22 15:03:44,,,,2014-05-22 12:20:03,s1
VA1,VA2,2014-05-27 14:00:01,,0,0,2124,2014-05-22 15:03:44,,,,2014-05-22 12:20:03,s1
VA5,VA6,2014-06-05 18:00:04,,0,0,2124,2014-05-22 15:48:24,,,,2014-05-22 12:20:03,s1
VA7,VA8,2014-06-09 18:00:07,,0,0,2124,2014-05-22 15:37:35,,,,2014-05-22 12:20:03,s2
VA9,VA0,2014-06-16 19:00:20,,0,0,2124,2014-05-22 14:17:33,,,,2014-05-22 12:20:03,ss"
columnClasses <- c("factor", "factor", "POSIXct", "factor", "integer", "factor", "integer", "factor", "factor", "factor", "integer", "factor", "factor")
data <- read.table(text=zz, header = TRUE, sep = ",", comment.char = "", quote = "", na.strings = c(""), colClasses = columnClasses)
有效代码应 return 第 2 行和第 3 行为重复行。
我们可以将 'ddate' 列转换为 'Date' class 并使用它来查找 duplicate
行。
d1 <- cbind(data[-3],as.Date(data$ddate))
data[duplicated(d1)|duplicated(d1, fromLast=TRUE),]
# or d ddate rdate changes class price fdate
#2 VA1 VA2 2014-05-26 14:00:01 <NA> 0 0 2124 2014-05-22 15:03:44
#3 VA1 VA2 2014-05-26 00:00:00 <NA> 0 0 2124 2014-05-22 15:03:44
# company number minutes added source
#2 <NA> <NA> NA 2014-05-22 12:20:03 s1
#3 <NA> <NA> NA 2014-05-22 12:20:03 s1
数据[重复(格式(数据$ddate,“%Y-%M-%d”))|重复(格式(数据$ddate,“%Y-%M-%d”),fromLast=TRUE),]
我有一些行,其中缺少 POSIXct 列的时间部分(=等于 00:00:00)。我应该如何找到只有时间不同的重复行?
如果我使用如下代码:
dataDuplicates <- data[duplicated(data, by = NULL) | duplicated(data, by = NULL, fromLast = TRUE), ]
那么就不会发现这样的情况。
如果我使用下面的代码:
setkey(data, <all fields are there except that data field>, physical = TRUE)
dataDuplicates <- data[duplicated(data) | duplicated(data, fromLast = TRUE), ]
那么即使日期不一样也会发现
测试代码如下:
zz <- "or,d,ddate,rdate,changes,class,price,fdate,company,number,minutes,added,source
VA3,VA4,2014-05-24 12:23:00,,0,0,2124,2014-05-22 15:50:16,,,,2014-05-22 12:20:03,ss
VA1,VA2,2014-05-26 14:00:01,,0,0,2124,2014-05-22 15:03:44,,,,2014-05-22 12:20:03,s1
VA1,VA2,2014-05-26 00:00:00,,0,0,2124,2014-05-22 15:03:44,,,,2014-05-22 12:20:03,s1
VA1,VA2,2014-05-27 14:00:01,,0,0,2124,2014-05-22 15:03:44,,,,2014-05-22 12:20:03,s1
VA5,VA6,2014-06-05 18:00:04,,0,0,2124,2014-05-22 15:48:24,,,,2014-05-22 12:20:03,s1
VA7,VA8,2014-06-09 18:00:07,,0,0,2124,2014-05-22 15:37:35,,,,2014-05-22 12:20:03,s2
VA9,VA0,2014-06-16 19:00:20,,0,0,2124,2014-05-22 14:17:33,,,,2014-05-22 12:20:03,ss"
columnClasses <- c("factor", "factor", "POSIXct", "factor", "integer", "factor", "integer", "factor", "factor", "factor", "integer", "factor", "factor")
data <- read.table(text=zz, header = TRUE, sep = ",", comment.char = "", quote = "", na.strings = c(""), colClasses = columnClasses)
有效代码应 return 第 2 行和第 3 行为重复行。
我们可以将 'ddate' 列转换为 'Date' class 并使用它来查找 duplicate
行。
d1 <- cbind(data[-3],as.Date(data$ddate))
data[duplicated(d1)|duplicated(d1, fromLast=TRUE),]
# or d ddate rdate changes class price fdate
#2 VA1 VA2 2014-05-26 14:00:01 <NA> 0 0 2124 2014-05-22 15:03:44
#3 VA1 VA2 2014-05-26 00:00:00 <NA> 0 0 2124 2014-05-22 15:03:44
# company number minutes added source
#2 <NA> <NA> NA 2014-05-22 12:20:03 s1
#3 <NA> <NA> NA 2014-05-22 12:20:03 s1
数据[重复(格式(数据$ddate,“%Y-%M-%d”))|重复(格式(数据$ddate,“%Y-%M-%d”),fromLast=TRUE),]