如何检查一系列事件是否有序?
How can I check if a sequence of events is in order?
我有一个数据table,其中每一列代表一个事件:如果事件发生,则有一个日期值,如果没有发生,则为空。现在,所有事件都是可选的,但如果它们发生了,它们必须遵循一个顺序(A,然后 B,C...)。
探索数据,我发现至少存在几个数据质量问题:例如事件 A 为空,事件 B 有一个日期:或者事件 A 的日期晚于事件 B。我必须检查 1000 多行中的 10 列,所以我想知道是否有办法用 R 自动执行此操作(我只需要标记序列是否正确,然后手动检查错误的情况)...我唯一能想到的就是做很多 ifelse 嵌套语句,这似乎根本不合适。
有人知道更好的 function/approach 吗?提前致谢,这里是一些虚拟数据:(以下事件可以具有相同的日期)
> dput(Book1)
structure(list(ID = 1:20, A = structure(c(17532, NA, NA, 17226,
17498, 17204, 17646, 17567, 17609, 17259, 17606, 17606, 17567,
17612, 17612, 17612, 17395, 17687, 17612, 17687), class = "Date"),
B = structure(c(17567, 17716, NA, 17259, 17562, NA, 17651,
17606, 17612, 17226, NA, 17681, NA, NA, NA, NA, 17407, 17687,
NA, 17716), class = "Date"), C = structure(c(NA, NA, NA,
17260, NA, NA, NA, NA, 17614, NA, NA, 17687, NA, 17687, NA,
NA, NA, NA, NA, 17716), class = "Date"), D = structure(c(NA,
NA, NA, 17407, NA, NA, NA, NA, 17625, NA, NA, NA, NA, NA,
NA, NA, NA, NA, NA, NA), class = "Date"), E = structure(c(NA,
NA, NA, 17606, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,
NA, NA, NA, NA, NA), class = "Date")), .Names = c("ID", "A",
"B", "C", "D", "E"), row.names = c(NA, -20L), spec = structure(list(
cols = structure(list(ID = structure(list(), class = c("collector_integer",
"collector")), A = structure(list(), class = c("collector_character",
"collector")), B = structure(list(), class = c("collector_character",
"collector")), C = structure(list(), class = c("collector_character",
"collector")), D = structure(list(), class = c("collector_character",
"collector")), E = structure(list(), class = c("collector_character",
"collector"))), .Names = c("ID", "A", "B", "C", "D", "E")),
default = structure(list(), class = c("collector_guess",
"collector"))), .Names = c("cols", "default"), class = "col_spec"), class =
c("tbl_df",
"tbl", "data.frame"))
所以,在这个例子中,第 2、10 和 14 行应该被标记。
提前致谢
您可以使用 apply()
依次检查每一行,并(在其中)使用 sapply()
检查行中的每个元素。
假设您的数据框名为 test_data
,我们将添加一个新列,显示根据您指定的规则,每行中的日期列是否有意义。
test_data$valid <- apply(test_data[2:ncol(test_data)], 1, function (x) {
# sapply iterates over each element in the row after the first one, checking
# all the previous elements
valid <- sapply(2:length(x), function (y) {
ifelse(
!is.na(x[y]) # we can only check an element if it is a date
& (
# if any of the elements before the current one are NA, this is a
# problem
sum(is.na(x[1:y-1]) > 0) |
# if any of the dates before the current one are greater than the
# current one, this is also a problem
max(x[1:y-1]) > x[y]
),
FALSE, TRUE)
})
# if any of the elements in `valid` are false, this says there is a problem in
# the data (note `valid` is shorter than `x` by one element because the first
# element isn't checked against itself)
ifelse(sum(valid) == length(x) - 1, TRUE, FALSE)
})
test_data[test_data$valid == FALSE,]
我会在 data.table
中执行此操作,但我确定 dplyr
版本类似:
library(data.table)
setDT(DF) # <- convert to data.table
DF[DF[ , melt(.SD, id.vars = 'ID')
][ , {
non_na_idx = which(!is.na(value))
any(diff(value) < 0, na.rm = TRUE) ||
(length(non_na_idx) &&
max(non_na_idx) != length(non_na_idx))
}, keyby = ID],
flag := i.V1, on = 'ID'][]
# ID A B C D E flag
# 1: 1 2018-01-01 2018-02-05 <NA> <NA> <NA> FALSE
# 2: 2 <NA> 2018-07-04 <NA> <NA> <NA> TRUE
# 3: 3 <NA> <NA> <NA> <NA> <NA> FALSE
# 4: 4 2017-03-01 2017-04-03 2017-04-04 2017-08-29 2018-03-16 FALSE
# 5: 5 2017-11-28 2018-01-31 <NA> <NA> <NA> FALSE
# 6: 6 2017-02-07 <NA> <NA> <NA> <NA> FALSE
# 7: 7 2018-04-25 2018-04-30 <NA> <NA> <NA> FALSE
# 8: 8 2018-02-05 2018-03-16 <NA> <NA> <NA> FALSE
# 9: 9 2018-03-19 2018-03-22 2018-03-24 2018-04-04 <NA> FALSE
# 10: 10 2017-04-03 2017-03-01 <NA> <NA> <NA> TRUE
# 11: 11 2018-03-16 <NA> <NA> <NA> <NA> FALSE
# 12: 12 2018-03-16 2018-05-30 2018-06-05 <NA> <NA> FALSE
# 13: 13 2018-02-05 <NA> <NA> <NA> <NA> FALSE
# 14: 14 2018-03-22 <NA> 2018-06-05 <NA> <NA> TRUE
# 15: 15 2018-03-22 <NA> <NA> <NA> <NA> FALSE
# 16: 16 2018-03-22 <NA> <NA> <NA> <NA> FALSE
# 17: 17 2017-08-17 2017-08-29 <NA> <NA> <NA> FALSE
# 18: 18 2018-06-05 2018-06-05 <NA> <NA> <NA> FALSE
# 19: 19 2018-03-22 <NA> <NA> <NA> <NA> FALSE
# 20: 20 2018-06-05 2018-07-04 2018-07-04 <NA> <NA> FALSE
apply
-style answers will force coercion of your table to a matrix, which can come with some unexpected side effects (and being slow, 对于更大的例子), 所以我选择重塑数据 long -- 我认为解决长数据形式的问题要简单得多。
整形完成 melt
:
DF[ , melt(.SD, id.vars = 'ID')]
# ID variable value
# 1: 1 A 2018-01-01
# 2: 2 A <NA>
# 3: 3 A <NA>
# 4: 4 A 2017-03-01
# 5: 5 A 2017-11-28
# 6: 6 A 2017-02-07
# 7: 7 A 2018-04-25
# 8: 8 A 2018-02-05
# 9: 9 A 2018-03-19
# 10: 10 A 2017-04-03
# < more rows here >
# 91: 11 E <NA>
# 92: 12 E <NA>
# 93: 13 E <NA>
# 94: 14 E <NA>
# 95: 15 E <NA>
# 96: 16 E <NA>
# 97: 17 E <NA>
# 98: 18 E <NA>
# 99: 19 E <NA>
# 100: 20 E <NA>
# ID variable value
您有两个要查找的条件 --
在任何行中,较高列中的日期(按字母排序)不应位于较低列中的日期之前。在数据的长格式中,这意味着每个 ID
的连续差异应该是单调递增的,或者等效地,diff(value)
始终是非负的。因此,如果 any(diff(value) < 0, na.rm = TRUE)
,我们的 flag
是 TRUE
,这意味着至少有一个这样的差异对于这个 ID
是负面的:
DF[ , melt(.SD, id.vars = 'ID')
][ , any(diff(na.omit(value)) < 0, na.rm = TRUE),
keyby = ID]
# ID V1
# 1: 1 FALSE
# < omitted; all FALSE >
# 9: 9 FALSE
# 10: 10 TRUE # <- column B comes before column A
# 11: 11 FALSE
# < omitted; all FALSE >
# 20: 20 FALSE
一旦列 "goes missing",它应该 "stay missing",这意味着观察值之间不应有 NA
差距。这相当于说 (a) 行中至少有一个非缺失值,并且 (b) 非缺失元素的数量与最高非缺失列的列号相同:
DF[ , melt(.SD, id.vars = 'ID')
][ , {
non_na_idx = which(!is.na(value))
length(non_na_idx) && max(non_na_idx) != length(non_na_idx)
}, keyby = ID]
# ID V1
# 1: 1 FALSE
# 2: 2 TRUE # <- Column A missing, B not
# 3: 3 FALSE
# < omitted; all FALSE >
# 13: 13 FALSE
# 14: 14 TRUE # <- Column B missing, C not
# 15: 15 FALSE
# < omitted; all FALSE >
# 20: 20 FALSE
结合这两个条件得到所有三行的标志。
最后,我们将新创建的标志连接回原来的 table,并创建一个名为 flag
的列。这可以分为两个步骤——创建带有标志列的 table,然后加入:
DF_with_flag =
DF[ , melt(.SD, id.vars = 'ID')
][ , {
non_na_idx = which(!is.na(value))
any(diff(na.omit(value)) < 0, na.rm = TRUE) ||
(length(non_na_idx) &&
max(non_na_idx) != length(non_na_idx))
}, keyby = ID]
DF[DF_with_flag, flag := i.V1, on = 'ID']
我有一个数据table,其中每一列代表一个事件:如果事件发生,则有一个日期值,如果没有发生,则为空。现在,所有事件都是可选的,但如果它们发生了,它们必须遵循一个顺序(A,然后 B,C...)。
探索数据,我发现至少存在几个数据质量问题:例如事件 A 为空,事件 B 有一个日期:或者事件 A 的日期晚于事件 B。我必须检查 1000 多行中的 10 列,所以我想知道是否有办法用 R 自动执行此操作(我只需要标记序列是否正确,然后手动检查错误的情况)...我唯一能想到的就是做很多 ifelse 嵌套语句,这似乎根本不合适。
有人知道更好的 function/approach 吗?提前致谢,这里是一些虚拟数据:(以下事件可以具有相同的日期)
> dput(Book1)
structure(list(ID = 1:20, A = structure(c(17532, NA, NA, 17226,
17498, 17204, 17646, 17567, 17609, 17259, 17606, 17606, 17567,
17612, 17612, 17612, 17395, 17687, 17612, 17687), class = "Date"),
B = structure(c(17567, 17716, NA, 17259, 17562, NA, 17651,
17606, 17612, 17226, NA, 17681, NA, NA, NA, NA, 17407, 17687,
NA, 17716), class = "Date"), C = structure(c(NA, NA, NA,
17260, NA, NA, NA, NA, 17614, NA, NA, 17687, NA, 17687, NA,
NA, NA, NA, NA, 17716), class = "Date"), D = structure(c(NA,
NA, NA, 17407, NA, NA, NA, NA, 17625, NA, NA, NA, NA, NA,
NA, NA, NA, NA, NA, NA), class = "Date"), E = structure(c(NA,
NA, NA, 17606, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,
NA, NA, NA, NA, NA), class = "Date")), .Names = c("ID", "A",
"B", "C", "D", "E"), row.names = c(NA, -20L), spec = structure(list(
cols = structure(list(ID = structure(list(), class = c("collector_integer",
"collector")), A = structure(list(), class = c("collector_character",
"collector")), B = structure(list(), class = c("collector_character",
"collector")), C = structure(list(), class = c("collector_character",
"collector")), D = structure(list(), class = c("collector_character",
"collector")), E = structure(list(), class = c("collector_character",
"collector"))), .Names = c("ID", "A", "B", "C", "D", "E")),
default = structure(list(), class = c("collector_guess",
"collector"))), .Names = c("cols", "default"), class = "col_spec"), class =
c("tbl_df",
"tbl", "data.frame"))
所以,在这个例子中,第 2、10 和 14 行应该被标记。
提前致谢
您可以使用 apply()
依次检查每一行,并(在其中)使用 sapply()
检查行中的每个元素。
假设您的数据框名为 test_data
,我们将添加一个新列,显示根据您指定的规则,每行中的日期列是否有意义。
test_data$valid <- apply(test_data[2:ncol(test_data)], 1, function (x) {
# sapply iterates over each element in the row after the first one, checking
# all the previous elements
valid <- sapply(2:length(x), function (y) {
ifelse(
!is.na(x[y]) # we can only check an element if it is a date
& (
# if any of the elements before the current one are NA, this is a
# problem
sum(is.na(x[1:y-1]) > 0) |
# if any of the dates before the current one are greater than the
# current one, this is also a problem
max(x[1:y-1]) > x[y]
),
FALSE, TRUE)
})
# if any of the elements in `valid` are false, this says there is a problem in
# the data (note `valid` is shorter than `x` by one element because the first
# element isn't checked against itself)
ifelse(sum(valid) == length(x) - 1, TRUE, FALSE)
})
test_data[test_data$valid == FALSE,]
我会在 data.table
中执行此操作,但我确定 dplyr
版本类似:
library(data.table)
setDT(DF) # <- convert to data.table
DF[DF[ , melt(.SD, id.vars = 'ID')
][ , {
non_na_idx = which(!is.na(value))
any(diff(value) < 0, na.rm = TRUE) ||
(length(non_na_idx) &&
max(non_na_idx) != length(non_na_idx))
}, keyby = ID],
flag := i.V1, on = 'ID'][]
# ID A B C D E flag
# 1: 1 2018-01-01 2018-02-05 <NA> <NA> <NA> FALSE
# 2: 2 <NA> 2018-07-04 <NA> <NA> <NA> TRUE
# 3: 3 <NA> <NA> <NA> <NA> <NA> FALSE
# 4: 4 2017-03-01 2017-04-03 2017-04-04 2017-08-29 2018-03-16 FALSE
# 5: 5 2017-11-28 2018-01-31 <NA> <NA> <NA> FALSE
# 6: 6 2017-02-07 <NA> <NA> <NA> <NA> FALSE
# 7: 7 2018-04-25 2018-04-30 <NA> <NA> <NA> FALSE
# 8: 8 2018-02-05 2018-03-16 <NA> <NA> <NA> FALSE
# 9: 9 2018-03-19 2018-03-22 2018-03-24 2018-04-04 <NA> FALSE
# 10: 10 2017-04-03 2017-03-01 <NA> <NA> <NA> TRUE
# 11: 11 2018-03-16 <NA> <NA> <NA> <NA> FALSE
# 12: 12 2018-03-16 2018-05-30 2018-06-05 <NA> <NA> FALSE
# 13: 13 2018-02-05 <NA> <NA> <NA> <NA> FALSE
# 14: 14 2018-03-22 <NA> 2018-06-05 <NA> <NA> TRUE
# 15: 15 2018-03-22 <NA> <NA> <NA> <NA> FALSE
# 16: 16 2018-03-22 <NA> <NA> <NA> <NA> FALSE
# 17: 17 2017-08-17 2017-08-29 <NA> <NA> <NA> FALSE
# 18: 18 2018-06-05 2018-06-05 <NA> <NA> <NA> FALSE
# 19: 19 2018-03-22 <NA> <NA> <NA> <NA> FALSE
# 20: 20 2018-06-05 2018-07-04 2018-07-04 <NA> <NA> FALSE
apply
-style answers will force coercion of your table to a matrix, which can come with some unexpected side effects (and being slow, 对于更大的例子), 所以我选择重塑数据 long -- 我认为解决长数据形式的问题要简单得多。
整形完成 melt
:
DF[ , melt(.SD, id.vars = 'ID')]
# ID variable value
# 1: 1 A 2018-01-01
# 2: 2 A <NA>
# 3: 3 A <NA>
# 4: 4 A 2017-03-01
# 5: 5 A 2017-11-28
# 6: 6 A 2017-02-07
# 7: 7 A 2018-04-25
# 8: 8 A 2018-02-05
# 9: 9 A 2018-03-19
# 10: 10 A 2017-04-03
# < more rows here >
# 91: 11 E <NA>
# 92: 12 E <NA>
# 93: 13 E <NA>
# 94: 14 E <NA>
# 95: 15 E <NA>
# 96: 16 E <NA>
# 97: 17 E <NA>
# 98: 18 E <NA>
# 99: 19 E <NA>
# 100: 20 E <NA>
# ID variable value
您有两个要查找的条件 --
在任何行中,较高列中的日期(按字母排序)不应位于较低列中的日期之前。在数据的长格式中,这意味着每个 ID
的连续差异应该是单调递增的,或者等效地,diff(value)
始终是非负的。因此,如果 any(diff(value) < 0, na.rm = TRUE)
,我们的 flag
是 TRUE
,这意味着至少有一个这样的差异对于这个 ID
是负面的:
DF[ , melt(.SD, id.vars = 'ID')
][ , any(diff(na.omit(value)) < 0, na.rm = TRUE),
keyby = ID]
# ID V1
# 1: 1 FALSE
# < omitted; all FALSE >
# 9: 9 FALSE
# 10: 10 TRUE # <- column B comes before column A
# 11: 11 FALSE
# < omitted; all FALSE >
# 20: 20 FALSE
一旦列 "goes missing",它应该 "stay missing",这意味着观察值之间不应有 NA
差距。这相当于说 (a) 行中至少有一个非缺失值,并且 (b) 非缺失元素的数量与最高非缺失列的列号相同:
DF[ , melt(.SD, id.vars = 'ID')
][ , {
non_na_idx = which(!is.na(value))
length(non_na_idx) && max(non_na_idx) != length(non_na_idx)
}, keyby = ID]
# ID V1
# 1: 1 FALSE
# 2: 2 TRUE # <- Column A missing, B not
# 3: 3 FALSE
# < omitted; all FALSE >
# 13: 13 FALSE
# 14: 14 TRUE # <- Column B missing, C not
# 15: 15 FALSE
# < omitted; all FALSE >
# 20: 20 FALSE
结合这两个条件得到所有三行的标志。
最后,我们将新创建的标志连接回原来的 table,并创建一个名为 flag
的列。这可以分为两个步骤——创建带有标志列的 table,然后加入:
DF_with_flag =
DF[ , melt(.SD, id.vars = 'ID')
][ , {
non_na_idx = which(!is.na(value))
any(diff(na.omit(value)) < 0, na.rm = TRUE) ||
(length(non_na_idx) &&
max(non_na_idx) != length(non_na_idx))
}, keyby = ID]
DF[DF_with_flag, flag := i.V1, on = 'ID']