tidyr 仅分离前 n 个实例

Question

我在 R 中有一个 data.frame，为简单起见，我想将其中一列分开。它看起来像这样：

V1
Value_is_the_best_one
This_is_the_prettiest_thing_I've_ever_seen
Here_is_the_next_example_of_what_I_want

我的 real 数据非常大（数百万行），所以我想使用 tidyr 的单独函数（因为它非常快）来分离出前几行实例。我希望结果如下：

V1       V2     V3     V4 
Value    is     the    best_one
This     is     the    prettiest_thing_I've_ever_seen
Here     is     the    next_example_of_what_I_want

如您所见，分隔符是 _ V4 列可以有不同数量的分隔符。我想保留 V4（而不是丢弃它），但不必担心里面有多少东西。总会有四列（即我的 none 行只有 V1-V3）。

这是我一直在使用的起始 tidyr 命令：

separate(df, V1, c("V1", "V2", "V3", "V4"), sep="_")

这摆脱了 V4（并发出警告，这不是最重要的）。

Answer 1

您需要 extra 参数和 "merge" 选项。这只允许与定义的新列一样多的拆分。

separate(df, V1, c("V1", "V2", "V3", "V4"), extra = "merge")

     V1 V2  V3                             V4
1 Value is the                       best_one
2  This is the prettiest_thing_I've_ever_seen
3  Here is the    next_example_of_what_I_want

Answer 2

这是另一个选项extract

library(tidyr)
extract(df1, V1, into = paste0("V", 1:4), "([^_]+)_([^_]+)_([^_]+)_(.*)")
#      V1 V2  V3                             V4
# 1 Value is the                       best_one
# 2  This is the prettiest_thing_I've_ever_seen
# 3  Here is the    next_example_of_what_I_want

另一个选项是stri_split来自library(stringi)，我们可以在其中指定分割数

library(stringi)
do.call(rbind, stri_split(df1$V1, fixed="_", n=4))

tidyr 仅分离前 n 个实例

tidyr separate only first n instances

r

tidyr