取消透视数据并填充空值以从 R 中的状态更改获取当前状态

Unpivoting data and filling null values to get current state from state changes in R

我已经在网上搜索了有关此问题的解决方案,但是无法找到该特定主题的答案。

我有一个数据框显示 19 个不同单位的数据,每个单位有 2 个部门。这些部门可以处于不同的状态(状态1-5),并且必须始终处于一个状态。

数据本身是这样的:

Time        department      fromState   toState Date
46051.41923 unit36:depr2    4           5       2017-05-22 10:47
46077.33833 unit37:depr1    3           4       2017-05-22 10:47
47057.31889 unit31:depr2    2           3       2017-05-22 11:04
47062.31889 unit31:depr1    3           6       2017-05-22 11:04

数据显示每个部门在状态更改之前 (fromState) 和状态更改之后 (toState) 的状态。

我更想拥有的是:

Date               unit36:depr2   unit37:depr1   unit31:depr2   unit31:depr1
2017-05-22 10:47    5              4              2              3
2017-05-22 11:04    5              4              3              6

这样我就可以随时查看所有单位的状态。正如您可能看到的那样,我已将 fromState 和 toState 手动合并到名称为 unit&depr 的列中。我还删除了重复的日期值。时间栏是用来制作日期栏的,也被删除了。

有什么方法可以以非手动方式执行此操作吗?

我建议您以不同的方式看待您的数据。除了 "from" 和 "to",我认为您应该考虑 "initial state",然后在它发生时调用更改。使用你的数据(这里加"x"方便对付read.table(text=...),请使用你自己的数据):

x <- read.table(text='Time        department      fromState   toState Date x
46051.41923 unit36:depr2    4           5       2017-05-22 10:47
46077.33833 unit37:depr1    3           4       2017-05-22 10:47
47057.31889 unit31:depr2    2           3       2017-05-22 11:04
47062.31889 unit31:depr1    3           6       2017-05-22 11:04', header=TRUE, stringsAsFactors=FALSE)
x$Date <- as.POSIXct(paste(x$Date, x$x))
x$x <- NULL

我将为此使用两个库,因为我认为它们在这里很合适并且易于阅读。我相信有人可以建议 data.table(可能更快)和 base-R(而不是 package-dependent)解决方案。

library(dplyr)
library(tidyr)

首先是确定所有部门的起始状态。 (0 日期正好是 "before anything else happened"。)

initial_state <- x %>%
  arrange(Date) %>%
  group_by(department) %>%
  summarize(
    date = as.POSIXct(0, origin='1970-01-01'),
    state = fromState[1]
  )
initial_state
# # A tibble: 4 × 3
#     department       date state
#          <chr>     <dttm> <int>
# 1 unit31:depr1 1970-01-01     3
# 2 unit31:depr2 1970-01-01     2
# 3 unit36:depr2 1970-01-01     4
# 4 unit37:depr1 1970-01-01     3

当事情发生变化时立即记录:

transitions <- select(x, department, date = Date, state = toState)
transitions
#     department                date state
# 1 unit36:depr2 2017-05-22 10:47:00     5
# 2 unit37:depr1 2017-05-22 10:47:00     4
# 3 unit31:depr2 2017-05-22 11:04:00     3
# 4 unit31:depr1 2017-05-22 11:04:00     6

下一步是取消透视:

bind_rows(initial_state, transitions) %>%
  spread(department, state)
# # A tibble: 3 × 5
#                  date `unit31:depr1` `unit31:depr2` `unit36:depr2` `unit37:depr1`
# *              <dttm>          <int>          <int>          <int>          <int>
# 1 1970-01-01 00:00:00              3              2              4              3
# 2 2017-05-22 10:47:00             NA             NA              5              4
# 3 2017-05-22 11:04:00              6              3             NA             NA

...意识到 NA 意味着 "nothing happened this day for this department, so carry-forward from the previous non-NA row"。幸运的是,zoo 包中有一个函数可以做到这一点:

na.locf package:zoo R Documentation

Last Observation Carried Forward

Description:

Generic function for replacing each 'NA' with the most recent non-'NA' prior to it.

library(zoo) # for clarity, not strictly requires since I use '::' here
bind_rows(initial_state, transitions) %>%
  spread(department, state) %>%
  mutate_all(zoo::na.locf) %>%
  filter(date > 0) # since I no longer need the "0" date
# # A tibble: 2 × 5
#                  date `unit31:depr1` `unit31:depr2` `unit36:depr2` `unit37:depr1`
#                <dttm>          <int>          <int>          <int>          <int>
# 1 2017-05-22 10:47:00              3              2              5              4
# 2 2017-05-22 11:04:00              6              3              5              4

这是使用 tidyverse 函数的另一种策略。首先,您的数据

library(tidyverse)
dd <- read_csv("Time,department,fromState,toState,Date
46051.41923,unit36:depr2,4,5,2017-05-22 10:47
46077.33833,unit37:depr1,3,4,2017-05-22 10:47
47057.31889,unit31:depr2,2,3,2017-05-22 11:04
47062.31889,unit31:depr1,3,6,2017-05-22 11:04")

现在我得到每个部门的第一个日期

start <- dd %>% 
  group_by(department) %>% 
  summarize(state=first(fromState)) %>% 
  spread(department, state)

现在对于每个日期,我都会得到所有当前状态

changes <- dd %>% 
  arrange(Date) %>% 
  select(Date, department, toState) %>% 
  split(.$Date)  %>% 
  map(spread, department, toState)

然后我使用 accumulate 来 "replay" 每个日期的变化。

alt_list_modify <- function(x, y) list_modify(x, !!!y)
final <- accumulate(changes, alt_list_modify, .init = start) %>% 
  tail(-1) %>% bind_rows()

这returns想要的结果

# A tibble: 2 x 5
  `unit31:depr1` `unit31:depr2` `unit36:depr2` `unit37:depr1`                Date
           <int>          <int>          <int>          <int>              <dttm>
1              3              2              5              4 2017-05-22 10:47:00
2              6              3              5              4 2017-05-22 11:04:00