为多组多级数据结构中的值创建空缺失行并计算组内行之间的差异

Question

假设我有以下数据集：

ID  Type  Group      Week    Value
111 A      Pepper     -1      10
112 B      Salt        2      20
113 C      Curry       4      40
114 D      Rosemary    9      90
211 A      Pepper     -1      15
212 B      Salt        2      30
214 D      Rosemary    9      135

其中 ID、类型和组以及周被输入到每周测量 "value" 的测量仪器中。有时每周会有多个结果，因此最初的整理是为每个每周测量创建一个平均值。

我愿意

a) 创建一个数据集，其中行会自动插入到周列中有空行的位置，因此它看起来像这样 - 始终使用类型顺序 A、B、C、D 和组顺序胡椒、盐、咖喱、迷迭香和第 -1、2、4、9 周。

ID  Type  Group      Week    Value
111 A      Pepper     -1      10
112 B      Salt        2      20
113 C      Curry       4      40
114 D      Rosemary    9      90
211 A      Pepper     -1      15
212 B      Salt        2      30
213 C      Curry       4      60
214 D      Rosemary    9      135

b) objective是只对每组计算垂直平面内测量值的差值即：

ID  Type  Group      Week    Value  Diff
111 A      Pepper     -1      10     NA
112 B      Salt        2      20     10
113 C      Curry       4      40     20 
114 D      Rosemary    9      90     50
211 A      Pepper     -1      15     NA
212 B      Salt        2      30     15
213 C      Curry       4      60     30
214 D      Rosemary    9      135    75

我可以看到如何在 for 循环中执行此操作，但必须有更优雅的方法吗？

Answer 1

我不确定这是否有帮助，但我认为这可能是一个开始。

如果您有重复的行组，我可能会创建一个通用数据框并重复多次，然后加入您的可用数据集。这将有效地插入缺失的行。

此外，如果您使用 tidyverse，则可以使用 lag 计算 diff。

请注意，这不会给出完全相同的结果，因为我不确定 Curry 的 60 来自哪里（稍后将编辑答案）。

library(tidyverse)

# Define number of repeating groups
N = 2

# Create generic group of Type, Group, Week
df <- data.frame(
  Type = c("A", "B", "C", "D"),
  Group = c("Pepper", "Salt", "Curry", "Rosemary"),
  Week = c(-1, 2, 4, 9)
)

# Represents the number of rows
nrow_df <- nrow(df)

# Repeat groups of rows N times
full_df <- df[rep(seq_len(nrow_df), times = N), ]

# Add ID numbers
full_df$ID <- rep(seq(110, (100 * N) + 10, by=100), each=nrow_df) + seq(1:nrow_df)

# Second data frame with missing rows
df2 <- read.table(text =
"ID  Type  Group      Week    Value
111 A      Pepper     -1      10
112 B      Salt        2      20
113 C      Curry       4      40
114 D      Rosemary    9      90
211 A      Pepper     -1      15
212 B      Salt        2      30
214 D      Rosemary    9      135", header = T, stringsAsFactors = T)

# Join the data frames and get differences
full_df %>%
  left_join(df2) %>%
  group_by(grp = ceiling(row_number()/nrow_df)) %>%
  mutate(Diff = Value - lag(Value))

# A tibble: 8 x 7
# Groups:   grp [2]
  Type  Group     Week    ID Value   grp  Diff
  <fct> <fct>    <dbl> <dbl> <int> <dbl> <int>
1 A     Pepper      -1   111    10     1    NA
2 B     Salt         2   112    20     1    10
3 C     Curry        4   113    40     1    20
4 D     Rosemary     9   114    90     1    50
5 A     Pepper      -1   211    15     2    NA
6 B     Salt         2   212    30     2    15
7 C     Curry        4   213    NA     2    NA
8 D     Rosemary     9   214   135     2    NA

为多组多级数据结构中的值创建空缺失行并计算组内行之间的差异

Create empty missing lines for values in multi-group multi-level data-structures and calculate difference between rows within groups

r

data-mining