根据 cumsum 索引重复数据帧行
Repeat dataframe rows based on cumsum index
我有一个数据框如下:
data.frame(title="Title", bk=c("Book 1", "Book 1", "Book 3"), ch=c("Chapter 1", "Chapter 2", "Chapter 1"))
title bk ch
1 Title Book 1 Chapter 1
2 Title Book 1 Chapter 2
3 Title Book 3 Chapter 1
我如何根据下面的 cumsum 指数重复每个观察:
id=c(1,1,1,2,2,3,3,3,3)
以便可以扩展数据框以容纳生成 cumsum 索引的源向量?
title bk ch source_vector
1 Title Book 1 Chapter 1 ...
1 Title Book 1 Chapter 1
1 Title Book 1 Chapter 1
2 Title Book 1 Chapter 2
2 Title Book 1 Chapter 2
3 Title Book 3 Chapter 1
3 Title Book 3 Chapter 1
3 Title Book 3 Chapter 1
3 Title Book 3 Chapter 1
一个选项是使用 separate_rows
library(tidyverse)
df1 %>%
separate_rows(content)
# title bk ch content
#1 Title Book 1 Chapter 1 This
#2 Title Book 1 Chapter 1 is
#3 Title Book 1 Chapter 1 the
#4 Title Book 1 Chapter 2 content
#5 Title Book 1 Chapter 2 of
#6 Title Book 3 Chapter 1 each
#7 Title Book 3 Chapter 1 chapter
#8 Title Book 3 Chapter 1 in
#9 Title Book 3 Chapter 1 books
如果我们需要复制原始行
df1 %>%
uncount(str_count(content, "\w+")) %>%
as_tibble
# A tibble: 9 x 4
# title bk ch content
# <fct> <fct> <fct> <fct>
#1 Title Book 1 Chapter 1 This is the
#2 Title Book 1 Chapter 1 This is the
#3 Title Book 1 Chapter 1 This is the
#4 Title Book 1 Chapter 2 content of
#5 Title Book 1 Chapter 2 content of
#6 Title Book 3 Chapter 1 each chapter in books
#7 Title Book 3 Chapter 1 each chapter in books
#8 Title Book 3 Chapter 1 each chapter in books
#9 Title Book 3 Chapter 1 each chapter in books
在 base 中你可以使用 do.call
of r.bind
,在你完成每一行的 strsplit
和 cbind
之后喜欢:
x <- data.frame(title="Title", bk=c("Book 1", "Book 1", "Book 3"), ch=c("Chapter 1", "Chapter 2", "Chapter 1"), content=c("This is the", "content of", "each chapter in books"))
do.call("rbind", by(x, 1:nrow(x), function(x) {cbind(x[-ncol(x)], str_split_content=strsplit(as.character(x$content[1]), " ")[[1]])}))
# title bk ch str_split_content
#1.1 Title Book 1 Chapter 1 This
#1.2 Title Book 1 Chapter 1 is
#1.3 Title Book 1 Chapter 1 the
#2.1 Title Book 1 Chapter 2 content
#2.2 Title Book 1 Chapter 2 of
#3.1 Title Book 3 Chapter 1 each
#3.2 Title Book 3 Chapter 1 chapter
#3.3 Title Book 3 Chapter 1 in
#3.4 Title Book 3 Chapter 1 books
如果您只想根据 content
中的字数扩展行,那么这里有一种方法,
library(splitstackshape)
expandRows(ddf, lengths(gregexpr("\W+", ddf$content)) + 1, count.is.col = FALSE)
# title bk ch content
#1 Title Book 1 Chapter 1 This is the
#1.1 Title Book 1 Chapter 1 This is the
#1.2 Title Book 1 Chapter 1 This is the
#2 Title Book 1 Chapter 2 content of
#2.1 Title Book 1 Chapter 2 content of
#3 Title Book 3 Chapter 1 each chapter in books
#3.1 Title Book 3 Chapter 1 each chapter in books
#3.2 Title Book 3 Chapter 1 each chapter in books
#3.3 Title Book 3 Chapter 1 each chapter in books
这更接近我要找的东西:
df %>%
mutate(str_split_content = str_split(content, " ")) %>%
unnest()
有人发帖,然后 revised/removed 不久前。
原来的str_split
内容实际上是标点符号。所以不完全是纯粹按字数拆分。
我有一个数据框如下:
data.frame(title="Title", bk=c("Book 1", "Book 1", "Book 3"), ch=c("Chapter 1", "Chapter 2", "Chapter 1"))
title bk ch
1 Title Book 1 Chapter 1
2 Title Book 1 Chapter 2
3 Title Book 3 Chapter 1
我如何根据下面的 cumsum 指数重复每个观察:
id=c(1,1,1,2,2,3,3,3,3)
以便可以扩展数据框以容纳生成 cumsum 索引的源向量?
title bk ch source_vector
1 Title Book 1 Chapter 1 ...
1 Title Book 1 Chapter 1
1 Title Book 1 Chapter 1
2 Title Book 1 Chapter 2
2 Title Book 1 Chapter 2
3 Title Book 3 Chapter 1
3 Title Book 3 Chapter 1
3 Title Book 3 Chapter 1
3 Title Book 3 Chapter 1
一个选项是使用 separate_rows
library(tidyverse)
df1 %>%
separate_rows(content)
# title bk ch content
#1 Title Book 1 Chapter 1 This
#2 Title Book 1 Chapter 1 is
#3 Title Book 1 Chapter 1 the
#4 Title Book 1 Chapter 2 content
#5 Title Book 1 Chapter 2 of
#6 Title Book 3 Chapter 1 each
#7 Title Book 3 Chapter 1 chapter
#8 Title Book 3 Chapter 1 in
#9 Title Book 3 Chapter 1 books
如果我们需要复制原始行
df1 %>%
uncount(str_count(content, "\w+")) %>%
as_tibble
# A tibble: 9 x 4
# title bk ch content
# <fct> <fct> <fct> <fct>
#1 Title Book 1 Chapter 1 This is the
#2 Title Book 1 Chapter 1 This is the
#3 Title Book 1 Chapter 1 This is the
#4 Title Book 1 Chapter 2 content of
#5 Title Book 1 Chapter 2 content of
#6 Title Book 3 Chapter 1 each chapter in books
#7 Title Book 3 Chapter 1 each chapter in books
#8 Title Book 3 Chapter 1 each chapter in books
#9 Title Book 3 Chapter 1 each chapter in books
在 base 中你可以使用 do.call
of r.bind
,在你完成每一行的 strsplit
和 cbind
之后喜欢:
x <- data.frame(title="Title", bk=c("Book 1", "Book 1", "Book 3"), ch=c("Chapter 1", "Chapter 2", "Chapter 1"), content=c("This is the", "content of", "each chapter in books"))
do.call("rbind", by(x, 1:nrow(x), function(x) {cbind(x[-ncol(x)], str_split_content=strsplit(as.character(x$content[1]), " ")[[1]])}))
# title bk ch str_split_content
#1.1 Title Book 1 Chapter 1 This
#1.2 Title Book 1 Chapter 1 is
#1.3 Title Book 1 Chapter 1 the
#2.1 Title Book 1 Chapter 2 content
#2.2 Title Book 1 Chapter 2 of
#3.1 Title Book 3 Chapter 1 each
#3.2 Title Book 3 Chapter 1 chapter
#3.3 Title Book 3 Chapter 1 in
#3.4 Title Book 3 Chapter 1 books
如果您只想根据 content
中的字数扩展行,那么这里有一种方法,
library(splitstackshape)
expandRows(ddf, lengths(gregexpr("\W+", ddf$content)) + 1, count.is.col = FALSE)
# title bk ch content
#1 Title Book 1 Chapter 1 This is the
#1.1 Title Book 1 Chapter 1 This is the
#1.2 Title Book 1 Chapter 1 This is the
#2 Title Book 1 Chapter 2 content of
#2.1 Title Book 1 Chapter 2 content of
#3 Title Book 3 Chapter 1 each chapter in books
#3.1 Title Book 3 Chapter 1 each chapter in books
#3.2 Title Book 3 Chapter 1 each chapter in books
#3.3 Title Book 3 Chapter 1 each chapter in books
这更接近我要找的东西:
df %>%
mutate(str_split_content = str_split(content, " ")) %>%
unnest()
有人发帖,然后 revised/removed 不久前。
原来的str_split
内容实际上是标点符号。所以不完全是纯粹按字数拆分。