为什么我的日期转换解决方案不再有效,尽管没有任何变化?
Why doesn't my date conversion solution work any more, despite nothing changing?
几个月前我写了一个 R 脚本,该脚本的一部分将字符日期转换为日期格式。
我最初遇到这个问题是在将字符转换为日期格式时引入了 NA
。
有人建议发生这种情况的原因是因为它必须期望日期的 day
元素是两个字符,例如 June 12th 2018
- 只有当 day
元素包含单个字符 - 例如 June 2nd 2018
.
提供的解决方案 (as.Date(df$date, format='%B %d %Y')
) 完美运行。
到现在。
我不仅得到 NA
值,而且还收到错误:Error: Duplicate identifiers for rows (12, 14), (13, 16)
.
我不知道这是什么意思 - 有人可以解释一下吗?
这是原始数据框:
time.per.day Top.0.type Count
1 July 27th 2018, 00:00:00.000 conversation-archived 2
2 July 27th 2018, 00:00:00.000 conversation-archived 1
3 July 28th 2018, 00:00:00.000 conversation-archived 4
4 July 28th 2018, 00:00:00.000 conversation-archived 1
5 July 29th 2018, 00:00:00.000 conversation-archived 2
6 July 29th 2018, 00:00:00.000 conversation-archived 2
7 July 29th 2018, 00:00:00.000 conversation-auto-archived 2
8 July 30th 2018, 00:00:00.000 conversation-archived 3
9 July 30th 2018, 00:00:00.000 conversation-archived 2
10 July 30th 2018, 00:00:00.000 conversation-auto-archived 1
11 July 31st 2018, 00:00:00.000 conversation-archived 1
12 August 1st 2018, 00:00:00.000 conversation-archived 1
13 August 1st 2018, 00:00:00.000 conversation-auto-archived 1
14 August 2nd 2018, 00:00:00.000 conversation-archived 4
15 August 2nd 2018, 00:00:00.000 conversation-archived 1
16 August 2nd 2018, 00:00:00.000 conversation-auto-archived 2
这是原始数据:
df <- structure(list(time.per.day = c("July 27th 2018, 00:00:00.000",
"July 27th 2018, 00:00:00.000", "July 28th 2018, 00:00:00.000",
"July 28th 2018, 00:00:00.000", "July 29th 2018, 00:00:00.000",
"July 29th 2018, 00:00:00.000", "July 29th 2018, 00:00:00.000",
"July 30th 2018, 00:00:00.000", "July 30th 2018, 00:00:00.000",
"July 30th 2018, 00:00:00.000", "July 31st 2018, 00:00:00.000",
"August 1st 2018, 00:00:00.000", "August 1st 2018, 00:00:00.000",
"August 2nd 2018, 00:00:00.000", "August 2nd 2018, 00:00:00.000",
"August 2nd 2018, 00:00:00.000"), Top.0.type = c("conversation-archived",
"conversation-archived", "conversation-archived", "conversation-archived",
"conversation-archived", "conversation-archived", "conversation-auto-archived",
"conversation-archived", "conversation-archived", "conversation-auto-archived",
"conversation-archived", "conversation-archived", "conversation-auto-archived",
"conversation-archived", "conversation-archived", "conversation-auto-archived"
), Count = c(2L, 1L, 4L, 1L, 2L, 2L, 2L, 3L, 2L, 1L, 1L, 1L,
1L, 4L, 1L, 2L)), class = "data.frame", row.names = c(NA, -16L
))
我重命名列 (colnames(df) <- c("date", "type", "retailer_code", "count")
) 并操作数据以使其看起来像某种方式,但现在在使用 as.Date(df$date, format='%B %d %Y')
之后,进行一些其他维护:
# Remove time and identifiers from date column
df$date <- gsub(", 00:00:00.000", "", df$date)
df$date <- gsub("st", "", df$date)
df$date <- gsub("nd", "", df$date)
df$date <- gsub("rd", "", df$date)
df$date <- gsub("th", "", df$date)
这是生成的数据框:
date type count
1 2018-07-27 Completed 2
2 2018-07-27 Completed 1
3 2018-07-28 Completed 4
4 2018-07-28 Completed 1
5 2018-07-29 Completed 2
6 2018-07-29 Completed 2
7 2018-07-29 Missed 2
8 2018-07-30 Completed 3
9 2018-07-30 Completed 2
10 2018-07-30 Missed 1
11 2018-07-31 Completed 1
12 <NA> Completed 1
13 <NA> Missed 1
14 <NA> Completed 4
15 <NA> Completed 1
16 <NA> Missed 2
这是结果数据帧的 dput
:
df <- structure(list(date = structure(c(17739, 17739, 17740, 17740,
17741, 17741, 17741, 17742, 17742, 17742, 17743, NA, NA, NA,
NA, NA), class = "Date"), type = c("Completed", "Completed",
"Completed", "Completed", "Completed", "Completed", "Missed",
"Completed", "Completed", "Missed", "Completed", "Completed",
"Missed", "Completed", "Completed", "Missed"), count = c(2L,
1L, 4L, 1L, 2L, 2L, 2L, 3L, 2L, 1L, 1L, 1L, 1L, 4L, 1L, 2L)), class = "data.frame", row.names = c(NA,
-16L))
为什么现在出错了?
我注意到 df$date <- gsub("st", "", df$date)
正在将 August
转换为 Augu
,因此这导致出现 NA 值。
我将其更改为 df$date <- gsub("1st", "", df$date)
,但这现在会导致生成的数据框出现其他问题(第 12-16 行(含)):
date type count
1 2018-07-27 Completed 2
2 2018-07-27 Completed 1
3 2018-07-28 Completed 4
4 2018-07-28 Completed 1
5 2018-07-29 Completed 2
6 2018-07-29 Completed 2
7 2018-07-29 Missed 2
8 2018-07-30 Completed 3
9 2018-07-30 Completed 2
10 2018-07-30 Missed 1
11 2018-07-03 Completed 1
12 0018-08-20 Completed 1
13 0018-08-20 Missed 1
14 0018-08-20 Completed 4
15 0018-08-20 Completed 1
16 0018-08-20 Missed 2
如何解决这个问题?
原来,
df$date <- gsub("st", "", df$date)
在匹配 "August" 的 "st" 以及“1st”时引起了问题。为了克服这个问题,我们只需要将“1st”替换为“1”,因为我们需要日期。
df$date <- gsub("1st", "1", df$date)
然后转换为日期。
as.Date(df$date, "%B %d %Y")
#[1] "2018-07-27" "2018-07-27" "2018-07-28" "2018-07-28" "2018-07-29" "2018-07-29"
#[7] "2018-07-29" "2018-07-30" "2018-07-30" "2018-07-30" "2018-07-31" "2018-08-01"
#[13] "2018-08-01" "2018-08-02" "2018-08-02" "2018-08-02"
理想情况下,硬编码和替换值不是一个好主意,这会导致此类问题,因此,我们可以在一个数字后跟序号一步而不是 4 个单独的 sub
s 时替换值.
所以在
之后
df$date <- sub(", 00:00:00.000", "", df$date)
我们可以直接做,
df$date <- sub("(\d+)(st|nd|rd|th)\b", "\1", df$date)
几个月前我写了一个 R 脚本,该脚本的一部分将字符日期转换为日期格式。
我最初遇到这个问题是在将字符转换为日期格式时引入了 NA
。
有人建议发生这种情况的原因是因为它必须期望日期的 day
元素是两个字符,例如 June 12th 2018
- 只有当 day
元素包含单个字符 - 例如 June 2nd 2018
.
提供的解决方案 (as.Date(df$date, format='%B %d %Y')
) 完美运行。
到现在。
我不仅得到 NA
值,而且还收到错误:Error: Duplicate identifiers for rows (12, 14), (13, 16)
.
我不知道这是什么意思 - 有人可以解释一下吗?
这是原始数据框:
time.per.day Top.0.type Count
1 July 27th 2018, 00:00:00.000 conversation-archived 2
2 July 27th 2018, 00:00:00.000 conversation-archived 1
3 July 28th 2018, 00:00:00.000 conversation-archived 4
4 July 28th 2018, 00:00:00.000 conversation-archived 1
5 July 29th 2018, 00:00:00.000 conversation-archived 2
6 July 29th 2018, 00:00:00.000 conversation-archived 2
7 July 29th 2018, 00:00:00.000 conversation-auto-archived 2
8 July 30th 2018, 00:00:00.000 conversation-archived 3
9 July 30th 2018, 00:00:00.000 conversation-archived 2
10 July 30th 2018, 00:00:00.000 conversation-auto-archived 1
11 July 31st 2018, 00:00:00.000 conversation-archived 1
12 August 1st 2018, 00:00:00.000 conversation-archived 1
13 August 1st 2018, 00:00:00.000 conversation-auto-archived 1
14 August 2nd 2018, 00:00:00.000 conversation-archived 4
15 August 2nd 2018, 00:00:00.000 conversation-archived 1
16 August 2nd 2018, 00:00:00.000 conversation-auto-archived 2
这是原始数据:
df <- structure(list(time.per.day = c("July 27th 2018, 00:00:00.000",
"July 27th 2018, 00:00:00.000", "July 28th 2018, 00:00:00.000",
"July 28th 2018, 00:00:00.000", "July 29th 2018, 00:00:00.000",
"July 29th 2018, 00:00:00.000", "July 29th 2018, 00:00:00.000",
"July 30th 2018, 00:00:00.000", "July 30th 2018, 00:00:00.000",
"July 30th 2018, 00:00:00.000", "July 31st 2018, 00:00:00.000",
"August 1st 2018, 00:00:00.000", "August 1st 2018, 00:00:00.000",
"August 2nd 2018, 00:00:00.000", "August 2nd 2018, 00:00:00.000",
"August 2nd 2018, 00:00:00.000"), Top.0.type = c("conversation-archived",
"conversation-archived", "conversation-archived", "conversation-archived",
"conversation-archived", "conversation-archived", "conversation-auto-archived",
"conversation-archived", "conversation-archived", "conversation-auto-archived",
"conversation-archived", "conversation-archived", "conversation-auto-archived",
"conversation-archived", "conversation-archived", "conversation-auto-archived"
), Count = c(2L, 1L, 4L, 1L, 2L, 2L, 2L, 3L, 2L, 1L, 1L, 1L,
1L, 4L, 1L, 2L)), class = "data.frame", row.names = c(NA, -16L
))
我重命名列 (colnames(df) <- c("date", "type", "retailer_code", "count")
) 并操作数据以使其看起来像某种方式,但现在在使用 as.Date(df$date, format='%B %d %Y')
之后,进行一些其他维护:
# Remove time and identifiers from date column
df$date <- gsub(", 00:00:00.000", "", df$date)
df$date <- gsub("st", "", df$date)
df$date <- gsub("nd", "", df$date)
df$date <- gsub("rd", "", df$date)
df$date <- gsub("th", "", df$date)
这是生成的数据框:
date type count
1 2018-07-27 Completed 2
2 2018-07-27 Completed 1
3 2018-07-28 Completed 4
4 2018-07-28 Completed 1
5 2018-07-29 Completed 2
6 2018-07-29 Completed 2
7 2018-07-29 Missed 2
8 2018-07-30 Completed 3
9 2018-07-30 Completed 2
10 2018-07-30 Missed 1
11 2018-07-31 Completed 1
12 <NA> Completed 1
13 <NA> Missed 1
14 <NA> Completed 4
15 <NA> Completed 1
16 <NA> Missed 2
这是结果数据帧的 dput
:
df <- structure(list(date = structure(c(17739, 17739, 17740, 17740,
17741, 17741, 17741, 17742, 17742, 17742, 17743, NA, NA, NA,
NA, NA), class = "Date"), type = c("Completed", "Completed",
"Completed", "Completed", "Completed", "Completed", "Missed",
"Completed", "Completed", "Missed", "Completed", "Completed",
"Missed", "Completed", "Completed", "Missed"), count = c(2L,
1L, 4L, 1L, 2L, 2L, 2L, 3L, 2L, 1L, 1L, 1L, 1L, 4L, 1L, 2L)), class = "data.frame", row.names = c(NA,
-16L))
为什么现在出错了?
我注意到 df$date <- gsub("st", "", df$date)
正在将 August
转换为 Augu
,因此这导致出现 NA 值。
我将其更改为 df$date <- gsub("1st", "", df$date)
,但这现在会导致生成的数据框出现其他问题(第 12-16 行(含)):
date type count
1 2018-07-27 Completed 2
2 2018-07-27 Completed 1
3 2018-07-28 Completed 4
4 2018-07-28 Completed 1
5 2018-07-29 Completed 2
6 2018-07-29 Completed 2
7 2018-07-29 Missed 2
8 2018-07-30 Completed 3
9 2018-07-30 Completed 2
10 2018-07-30 Missed 1
11 2018-07-03 Completed 1
12 0018-08-20 Completed 1
13 0018-08-20 Missed 1
14 0018-08-20 Completed 4
15 0018-08-20 Completed 1
16 0018-08-20 Missed 2
如何解决这个问题?
原来,
df$date <- gsub("st", "", df$date)
在匹配 "August" 的 "st" 以及“1st”时引起了问题。为了克服这个问题,我们只需要将“1st”替换为“1”,因为我们需要日期。
df$date <- gsub("1st", "1", df$date)
然后转换为日期。
as.Date(df$date, "%B %d %Y")
#[1] "2018-07-27" "2018-07-27" "2018-07-28" "2018-07-28" "2018-07-29" "2018-07-29"
#[7] "2018-07-29" "2018-07-30" "2018-07-30" "2018-07-30" "2018-07-31" "2018-08-01"
#[13] "2018-08-01" "2018-08-02" "2018-08-02" "2018-08-02"
理想情况下,硬编码和替换值不是一个好主意,这会导致此类问题,因此,我们可以在一个数字后跟序号一步而不是 4 个单独的 sub
s 时替换值.
所以在
之后df$date <- sub(", 00:00:00.000", "", df$date)
我们可以直接做,
df$date <- sub("(\d+)(st|nd|rd|th)\b", "\1", df$date)