如何删除 R 中不规则的单词块？

Question

这是可重现的例子。

df2 <- data.frame(Num = c(1,2,3), Comment = c('nick       comment12021.12.01      nickn comment2222021.12.02       nickname333       commennnnt222021.12.01', 'nick       comment12021.12.01      nickn comment2222021.12.02       nickname333       commeeeent222021.12.01','nick       comment12021.12.01      nickn      comment2222021.12.02       nickname3333333       comment22021.12.01') )

Num           Comment
----------------------------------------------------------------------------
1      Tom    comment1~   Jay     comment2     Yun    comment 3 ~
2      Tim    comment1~   Cristal     comment2~      Lomio    comment3~
3      Tracer  comment1~   Teemo   comment2~      Irelia   comment3~
--------------------------------------------------------------------------

我有一个包含 2 列和许多行的数据框。这些是我从爬取网站得到的评论。然而，由于它是一个非常动态的网站，我别无选择，只能同时获得多个人的昵称和评论。

我想从这段不规则的文本中删除昵称，并创建一个只有评论的词云。但是我想不出只删除昵称的方法。昵称和评论的长度是不规则的，所以我不能按照我的方式去做。

Answer 1

如果您有一个固定的分隔符（就像您在评论中提到的恰好七个空格（" {7}" 使用正则表达式）），您可以执行以下操作：

dd <- data.frame(
  id = 1:3,
  comment = c(
    "Tom       comment1~       Jay       comment2~       Yun       comment3~",
    "Tim       comment1~       Cristal       comment2~       Lomio       comment3~",
    "Tracer       comment1~       Teemo       comment2~       Irelia       comment3~"
  )
)


extract_comments <- function(comments) {
  lapply(
    comments, 
    function(x) {
      sp <- strsplit(x, " {7}")[[1]]
      sp <- trimws(sp)
      ppl <- seq(1, length(sp), by = 2)
      data.frame(
        ex_person = sp[ppl],
        ex_comment = sp[ppl + 1]
      )
    }
  )
}

dd$extracted <- extract_comments(dd$comment)

tidyr::unnest(dd, extracted)
#> # A tibble: 9 x 4
#>      id comment                             ex_person ex_comment
#>   <int> <chr>                               <chr>     <chr>     
#> 1     1 Tom       comment1~       Jay     ~ Tom       comment1~ 
#> 2     1 Tom       comment1~       Jay     ~ Jay       comment2~ 
#> 3     1 Tom       comment1~       Jay     ~ Yun       comment 3 
#> 4     2 Tim       comment1~       Cristal ~ Tim       comment1~ 
#> 5     2 Tim       comment1~       Cristal ~ Cristal   comment2~ 
#> 6     2 Tim       comment1~       Cristal ~ Lomio     comment3~ 
#> 7     3 Tracer       comment1~       Teemo~ Tracer    comment1~ 
#> 8     3 Tracer       comment1~       Teemo~ Teemo     comment2~ 
#> 9     3 Tracer       comment1~       Teemo~ Irelia    comment3~

如何删除 R 中不规则的单词块？

How can I delete irregular chunks of words in R?

r

web-crawler

dataframe