在数据框中重复行并添加增量字段
Repeat rows in a data frame AND add an increment field
我找到了很多关于如何复制记录的答案,但我还想为每条复制记录添加一个增量字段。我发现了一个类似的问题,但他们没有 startValue 字段:Repeat the rows in a data frame based on values in a specific column.
我的数据框以
开头
df <-
data startValue freq
a 3.4 3
b 2.1 2
c 6.3 1
我想要这个输出
df.expanded <-
data startValue value
a 3.4 3
a 3.4 4
a 3.4 5
b 2.1 2
b 2.1 3
c 6.3 6
我确实找到了一种方法来做到这一点,但我想要一些更简单的方法来处理大型数据集。这是我所做的工作。
df <- data.frame(data = c("a", "b", "c"),
startValue = c(3.4, 2.1, 6.3),
freq = c(3,2,1))
df
# find the largest integer that I will need as an index.
n <- floor(max(df$startValue + df$freq))-1
# repeat each df record n times. Only the record with the
# largest startValue + freq needs to be repeated this many
# times, but I am repeating everything this many times.
df.expanded <- df[rep(row.names(df), each = n), ]
# Use recycling to fill a new column. Now I have created
# a Cartesian product. If n is 46, records with a
# freq of 46 are repeated just the right number of times.
# but records with a freq of 2 are repeated many more times
# than is needed.
df.expanded$value <- 1:n
# finally, I filter out all the extra repeats that I didn't need.
df.expanded <-
df.expanded[df.expanded$value >= floor(df.expanded$startValue)
& df.expanded$value < floor(df.expanded$startValue+df.expanded$freq),]
df.expanded[-3]
有没有一种方法可以更好地处理大型数据集?大多数记录需要少于 5 次重复,但少数需要 50 次重复。当 10000 条记录中只有 1 条需要大量重复时,我不喜欢将所有内容重复 50 次的想法。谢谢
您可以使用 tidyr
中的 uncount
library(dplyr)
library(tidyr)
df %>%
uncount(weights = freq, .id = "n", .remove = F) %>%
mutate(value = freq + n - 1)
data startValue freq n value
1 a 3.4 3 1 3
2 a 3.4 3 2 4
3 a 3.4 3 3 5
4 b 2.1 2 1 2
5 b 2.1 2 2 3
6 c 6.3 1 1 1
我不知道你为什么想要那个但使用 tidyverse
:
df %>%
mutate(value = map(freq,~.:(2*.-1))) %>%
unnest %>%
select(-freq)
# data startValue value
# 1 a 3.4 3
# 2 a 3.4 4
# 3 a 3.4 5
# 4 b 2.1 2
# 5 b 2.1 3
# 6 c 6.3 1
我找到了很多关于如何复制记录的答案,但我还想为每条复制记录添加一个增量字段。我发现了一个类似的问题,但他们没有 startValue 字段:Repeat the rows in a data frame based on values in a specific column.
我的数据框以
开头df <-
data startValue freq
a 3.4 3
b 2.1 2
c 6.3 1
我想要这个输出
df.expanded <-
data startValue value
a 3.4 3
a 3.4 4
a 3.4 5
b 2.1 2
b 2.1 3
c 6.3 6
我确实找到了一种方法来做到这一点,但我想要一些更简单的方法来处理大型数据集。这是我所做的工作。
df <- data.frame(data = c("a", "b", "c"),
startValue = c(3.4, 2.1, 6.3),
freq = c(3,2,1))
df
# find the largest integer that I will need as an index.
n <- floor(max(df$startValue + df$freq))-1
# repeat each df record n times. Only the record with the
# largest startValue + freq needs to be repeated this many
# times, but I am repeating everything this many times.
df.expanded <- df[rep(row.names(df), each = n), ]
# Use recycling to fill a new column. Now I have created
# a Cartesian product. If n is 46, records with a
# freq of 46 are repeated just the right number of times.
# but records with a freq of 2 are repeated many more times
# than is needed.
df.expanded$value <- 1:n
# finally, I filter out all the extra repeats that I didn't need.
df.expanded <-
df.expanded[df.expanded$value >= floor(df.expanded$startValue)
& df.expanded$value < floor(df.expanded$startValue+df.expanded$freq),]
df.expanded[-3]
有没有一种方法可以更好地处理大型数据集?大多数记录需要少于 5 次重复,但少数需要 50 次重复。当 10000 条记录中只有 1 条需要大量重复时,我不喜欢将所有内容重复 50 次的想法。谢谢
您可以使用 tidyr
uncount
library(dplyr)
library(tidyr)
df %>%
uncount(weights = freq, .id = "n", .remove = F) %>%
mutate(value = freq + n - 1)
data startValue freq n value
1 a 3.4 3 1 3
2 a 3.4 3 2 4
3 a 3.4 3 3 5
4 b 2.1 2 1 2
5 b 2.1 2 2 3
6 c 6.3 1 1 1
我不知道你为什么想要那个但使用 tidyverse
:
df %>%
mutate(value = map(freq,~.:(2*.-1))) %>%
unnest %>%
select(-freq)
# data startValue value
# 1 a 3.4 3
# 2 a 3.4 4
# 3 a 3.4 5
# 4 b 2.1 2
# 5 b 2.1 3
# 6 c 6.3 1