为什么我的日期转换解决方案不再有效,尽管没有任何变化?

Why doesn't my date conversion solution work any more, despite nothing changing?

几个月前我写了一个 R 脚本,该脚本的一部分将字符日期转换为日期格式。

我最初遇到这个问题是在将字符转换为日期格式时引入了 NA

有人建议发生这种情况的原因是因为它必须期望日期的 day 元素是两个字符,例如 June 12th 2018 - 只有当 day 元素包含单个字符 - 例如 June 2nd 2018.

提供的解决方案 (as.Date(df$date, format='%B %d %Y')) 完美运行。

到现在。

我不仅得到 NA 值,而且还收到错误:Error: Duplicate identifiers for rows (12, 14), (13, 16).

我不知道这是什么意思 - 有人可以解释一下吗?

这是原始数据框:

                    time.per.day                 Top.0.type Count
1   July 27th 2018, 00:00:00.000      conversation-archived     2
2   July 27th 2018, 00:00:00.000      conversation-archived     1
3   July 28th 2018, 00:00:00.000      conversation-archived     4
4   July 28th 2018, 00:00:00.000      conversation-archived     1
5   July 29th 2018, 00:00:00.000      conversation-archived     2
6   July 29th 2018, 00:00:00.000      conversation-archived     2
7   July 29th 2018, 00:00:00.000 conversation-auto-archived     2
8   July 30th 2018, 00:00:00.000      conversation-archived     3
9   July 30th 2018, 00:00:00.000      conversation-archived     2
10  July 30th 2018, 00:00:00.000 conversation-auto-archived     1
11  July 31st 2018, 00:00:00.000      conversation-archived     1
12 August 1st 2018, 00:00:00.000      conversation-archived     1
13 August 1st 2018, 00:00:00.000 conversation-auto-archived     1
14 August 2nd 2018, 00:00:00.000      conversation-archived     4
15 August 2nd 2018, 00:00:00.000      conversation-archived     1
16 August 2nd 2018, 00:00:00.000 conversation-auto-archived     2

这是原始数据:

df <- structure(list(time.per.day = c("July 27th 2018, 00:00:00.000", 
"July 27th 2018, 00:00:00.000", "July 28th 2018, 00:00:00.000", 
"July 28th 2018, 00:00:00.000", "July 29th 2018, 00:00:00.000", 
"July 29th 2018, 00:00:00.000", "July 29th 2018, 00:00:00.000", 
"July 30th 2018, 00:00:00.000", "July 30th 2018, 00:00:00.000", 
"July 30th 2018, 00:00:00.000", "July 31st 2018, 00:00:00.000", 
"August 1st 2018, 00:00:00.000", "August 1st 2018, 00:00:00.000", 
"August 2nd 2018, 00:00:00.000", "August 2nd 2018, 00:00:00.000", 
"August 2nd 2018, 00:00:00.000"), Top.0.type = c("conversation-archived", 
"conversation-archived", "conversation-archived", "conversation-archived", 
"conversation-archived", "conversation-archived", "conversation-auto-archived", 
"conversation-archived", "conversation-archived", "conversation-auto-archived", 
"conversation-archived", "conversation-archived", "conversation-auto-archived", 
"conversation-archived", "conversation-archived", "conversation-auto-archived"
), Count = c(2L, 1L, 4L, 1L, 2L, 2L, 2L, 3L, 2L, 1L, 1L, 1L, 
1L, 4L, 1L, 2L)), class = "data.frame", row.names = c(NA, -16L
))

我重命名列 (colnames(df) <- c("date", "type", "retailer_code", "count")) 并操作数据以使其看起来像某种方式,但现在在使用 as.Date(df$date, format='%B %d %Y') 之后,进行一些其他维护:

 # Remove time and identifiers from date column
df$date <- gsub(", 00:00:00.000", "", df$date)
df$date <- gsub("st", "", df$date)
df$date <- gsub("nd", "", df$date)
df$date <- gsub("rd", "", df$date)
df$date <- gsub("th", "", df$date)

这是生成的数据框:

         date      type count
1  2018-07-27 Completed     2
2  2018-07-27 Completed     1
3  2018-07-28 Completed     4
4  2018-07-28 Completed     1
5  2018-07-29 Completed     2
6  2018-07-29 Completed     2
7  2018-07-29    Missed     2
8  2018-07-30 Completed     3
9  2018-07-30 Completed     2
10 2018-07-30    Missed     1
11 2018-07-31 Completed     1
12       <NA> Completed     1
13       <NA>    Missed     1
14       <NA> Completed     4
15       <NA> Completed     1
16       <NA>    Missed     2

这是结果数据帧的 dput

df <- structure(list(date = structure(c(17739, 17739, 17740, 17740, 
17741, 17741, 17741, 17742, 17742, 17742, 17743, NA, NA, NA, 
NA, NA), class = "Date"), type = c("Completed", "Completed", 
"Completed", "Completed", "Completed", "Completed", "Missed", 
"Completed", "Completed", "Missed", "Completed", "Completed", 
"Missed", "Completed", "Completed", "Missed"), count = c(2L, 
1L, 4L, 1L, 2L, 2L, 2L, 3L, 2L, 1L, 1L, 1L, 1L, 4L, 1L, 2L)), class = "data.frame", row.names = c(NA, 
-16L))

为什么现在出错了?


我注意到 df$date <- gsub("st", "", df$date) 正在将 August 转换为 Augu,因此这导致出现 NA 值。

我将其更改为 df$date <- gsub("1st", "", df$date),但这现在会导致生成的数据框出现其他问题(第 12-16 行(含)):

         date      type count
1  2018-07-27 Completed     2
2  2018-07-27 Completed     1
3  2018-07-28 Completed     4
4  2018-07-28 Completed     1
5  2018-07-29 Completed     2
6  2018-07-29 Completed     2
7  2018-07-29    Missed     2
8  2018-07-30 Completed     3
9  2018-07-30 Completed     2
10 2018-07-30    Missed     1
11 2018-07-03 Completed     1
12 0018-08-20 Completed     1
13 0018-08-20    Missed     1
14 0018-08-20 Completed     4
15 0018-08-20 Completed     1
16 0018-08-20    Missed     2

如何解决这个问题?

原来,

df$date <- gsub("st", "", df$date)

在匹配 "August" 的 "st" 以及“1st”时引起了问题。为了克服这个问题,我们只需要将“1st”替换为“1”,因为我们需要日期。

df$date <- gsub("1st", "1", df$date)

然后转换为日期。

as.Date(df$date, "%B %d %Y")

#[1]  "2018-07-27" "2018-07-27" "2018-07-28" "2018-07-28" "2018-07-29" "2018-07-29"
#[7]  "2018-07-29" "2018-07-30" "2018-07-30" "2018-07-30" "2018-07-31" "2018-08-01"
#[13] "2018-08-01" "2018-08-02" "2018-08-02" "2018-08-02"

理想情况下,硬编码和替换值不是一个好主意,这会导致此类问题,因此,我们可以在一个数字后跟序号一步而不是 4 个单独的 subs 时替换值.

所以在

之后
df$date <- sub(", 00:00:00.000", "", df$date)

我们可以直接做,

df$date <- sub("(\d+)(st|nd|rd|th)\b", "\1", df$date)